根据多个条件合并两个 CSV 文件
Merging Two CSV Files Based on Many Criteria
我有两个 CSV 文件。它们有相同的列,但同一列中的每一行都不是唯一的,如下所示:
gpo_full.csv:
Date hearing_sub_type topic Specific_Date
January,1997 Oversight weather January 12,1997
June,2000 General life June 5,2000
January,1997 General forest January 1,1997
April,2001 Oversight people NaN
June,2000 Oversight depressed June 6,2000
January,1997 General weather January 1,1997
June,2000 Oversight depressed June 5,2000
CAP_cols.csv:
majortopic id Chamber topic Date Specific_Date
21 79846 1 many forest January,1997 January 1,1997
4 79847 2 emotion June,2000 June 6,2000
13 79848 1 NaN May,2001 NaN
7 79849 2 good life June,2000 June 5,2000
21 79850 1 good weather January,1997 January 1,1997
25 79851 1 rain & cloudy January,1997 January 12,1997
6 79852 2 sad & depressed June,2000 June 5,2000
我想使用三个条件来匹配这些数据:Specific_Date、日期和主题。
首先,我想使用“日期”列来对这些数据进行分组。接下来,我尝试使用“Specific_Date”列来缩小范围,因为该列中丢失了一些数据。最后,我想使用类似词嵌入的“主题”列来确保 gpo_full 中的哪些行可以与 CAP_cols 中的唯一行对应。
我尝试使用“日期”列对数据进行分组并将它们合并到 JSON 文件中。但是,我被困在实现下一步以按特定日期和主题缩小范围的过程中。
我对这个输出的想法是:
{
"Date": "January,1997",
"Specific_Date": "January 12,1997"
"Topic": {"GPO": "weather", "CAP": "rain & cloudy"}
"GPO": {
"hearing_sub_type": "Oversight",
and other columns
}
"CAP": {
"majortopic": "25",
"id": "79851",
"Chamber": "1"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 1,1997"
"Topic": {"GPO": "forest", "CAP": "many forest"}
"GPO": {
"hearing_sub_type": "General",
and other columns
}
"CAP": {
"majortopic": "21",
"id": "79846",
"Chamber": "1"
}
and similar for others}
我想了三天了,一点头绪也没有。任何实现这一目标的想法都会非常有帮助!非常感谢!
主题匹配有几个问题,所以你需要扩展我使用的 match_topic()
方法,但我添加了一些逻辑以查看最后没有匹配的内容。
results
变量包含一个字典列表,您可以轻松地将其保存为 JSON 文件。
检查内联注释以了解我使用的逻辑推理。
旁注:
如果我是你,我会稍微重组 JSON。将 topic
作为 key/value 对放在 GPO
和 CAP
键下对我来说比 Topic
键和单独的 GPO
和 CAP
key/value 对...
import csv
from pprint import pprint
import json
# load gpo_full.csv into a list of dict using
# csv.DictReader & list comprehension
with open("path/to/file/gpo_full.csv") as infile:
gpo_full = [item for item in csv.DictReader(infile)]
# do the same for CAP_cols.csv
with open("path/to/file/CAP_cols.csv") as infile:
cap_cols = [item for item in csv.DictReader(infile)]
def match_topic(gpo_topic: str, cap_topic: str) -> bool:
"""We need a function as some of the mapping is not simple
Args:
gpo_topic (str): gpo topic
cap_topic (str): CAP topic
Returns:
bool: True if topics match
"""
# this one is simple
if gpo_topic in cap_topic:
return True
# you need to repeat the below conditional check
# for each custom topic matching
elif gpo_topic == "weather" and cap_topic == "rain & cloudy":
return True
# example secondary topic matching
elif gpo_topic == "foo" and cap_topic == "bar":
return True
# finally return false for no matches
return False
# we need this later
gpo_length = len(gpo_full)
results = []
cap_left_over = []
# do the actual mapping
# this could've been done above, but I separated it intentionally
for cap in cap_cols:
found = False
# first find the corresponding gpo
for index, gpo in enumerate(gpo_full):
if (
gpo["Specific_Date"] == cap["Specific_Date"] # check by date
and match_topic(gpo["topic"], cap["topic"]) # check if topics match
):
results.append({
"Date": gpo["Date"],
"Specific_Date": gpo["Specific_Date"],
"Topic": {
"GPO": gpo["topic"],
"CAP": cap["topic"]
},
"GPO": {
"hearing_sub_type": gpo["hearing_sub_type"]
},
"CAP": {
"majortopic": cap["majortopic"],
"id": cap["id"],
"Chamber": cap["Chamber"]
}
})
# pop & break to remove the gpo item
# this is so you're left over with a list of
# gpo items that didn't match
# it also speeds up further matches
gpo_full.pop(index)
found = True
break
# this is to check if there's stuff left over
if not found:
cap_left_over.append(cap)
with open('path/to/file/combined_json.json', 'w') as outfile:
json.dump(results, outfile, indent=4)
pprint(results)
print(f'\nLength:\n Results: {len(results)}\n CAP: {len(cap)}\n GPO: {gpo_length}')
print('\nLeftover GPO:')
pprint(gpo_full)
print('\nLeftover CAP:')
pprint(cap_left_over)
输出
我已经从输出中删除了 pprint(results)
,请参阅 JSON 进一步向下
Length:
Results: 5
CAP: 6
GPO: 7
Leftover GPO:
[{'Date': 'April,2001',
'Specific_Date': 'NaN ',
'hearing_sub_type': 'Oversight',
'topic': 'people'},
{'Date': 'June,2000',
'Specific_Date': 'June 6,2000',
'hearing_sub_type': 'Oversight',
'topic': 'depressed'}]
Leftover CAP:
[{'Chamber': '2',
'Date': 'June,2000',
'Specific_Date': 'June 6,2000',
'id': '79847',
'majortopic': '4',
'topic': 'emotion'},
{'Chamber': '1',
'Date': 'May,2001',
'Specific_Date': 'NaN',
'id': '79848',
'majortopic': '13',
'topic': 'NaN'}]
path/to/file/gpo_full.csv
Date,hearing_sub_type,topic,Specific_Date
"January,1997",Oversight,weather,"January 12,1997"
"June,2000",General,life,"June 5,2000"
"January,1997",General,forest,"January 1,1997"
"April,2001",Oversight,people,NaN
"June,2000",Oversight,depressed,"June 6,2000"
"January,1997",General,weather,"January 1,1997"
"June,2000",Oversight,depressed,"June 5,2000"
path/to/file/CAP_cols.csv
majortopic,id,Chamber,topic,Date,Specific_Date
21,79846,1,many forest,"January,1997","January 1,1997"
4,79847,2,emotion,"June,2000","June 6,2000"
13,79848,1,NaN,"May,2001","NaN"
7,79849,2,good life,"June,2000","June 5,2000"
21,79850,1,good weather,"January,1997","January 1,1997"
25,79851,1,rain & cloudy,"January,1997","January 12,1997"
6,79852,2,sad & depressed,"June,2000","June 5,2000"
path/to/file/combined_json.json
[
{
"Date": "January,1997",
"Specific_Date": "January 1,1997",
"Topic": {
"GPO": "forest",
"CAP": "many forest"
},
"GPO": {
"hearing_sub_type": "General"
},
"CAP": {
"majortopic": "21",
"id": "79846",
"Chamber": "1"
}
},
{
"Date": "June,2000",
"Specific_Date": "June 5,2000",
"Topic": {
"GPO": "life",
"CAP": "good life"
},
"GPO": {
"hearing_sub_type": "General"
},
"CAP": {
"majortopic": "7",
"id": "79849",
"Chamber": "2"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 1,1997",
"Topic": {
"GPO": "weather",
"CAP": "good weather"
},
"GPO": {
"hearing_sub_type": "General"
},
"CAP": {
"majortopic": "21",
"id": "79850",
"Chamber": "1"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 12,1997",
"Topic": {
"GPO": "weather",
"CAP": "rain & cloudy"
},
"GPO": {
"hearing_sub_type": "Oversight"
},
"CAP": {
"majortopic": "25",
"id": "79851",
"Chamber": "1"
}
},
{
"Date": "June,2000",
"Specific_Date": "June 5,2000",
"Topic": {
"GPO": "depressed",
"CAP": "sad & depressed"
},
"GPO": {
"hearing_sub_type": "Oversight"
},
"CAP": {
"majortopic": "6",
"id": "79852",
"Chamber": "2"
}
}
]
我有两个 CSV 文件。它们有相同的列,但同一列中的每一行都不是唯一的,如下所示:
gpo_full.csv:
Date hearing_sub_type topic Specific_Date
January,1997 Oversight weather January 12,1997
June,2000 General life June 5,2000
January,1997 General forest January 1,1997
April,2001 Oversight people NaN
June,2000 Oversight depressed June 6,2000
January,1997 General weather January 1,1997
June,2000 Oversight depressed June 5,2000
CAP_cols.csv:
majortopic id Chamber topic Date Specific_Date
21 79846 1 many forest January,1997 January 1,1997
4 79847 2 emotion June,2000 June 6,2000
13 79848 1 NaN May,2001 NaN
7 79849 2 good life June,2000 June 5,2000
21 79850 1 good weather January,1997 January 1,1997
25 79851 1 rain & cloudy January,1997 January 12,1997
6 79852 2 sad & depressed June,2000 June 5,2000
我想使用三个条件来匹配这些数据:Specific_Date、日期和主题。
首先,我想使用“日期”列来对这些数据进行分组。接下来,我尝试使用“Specific_Date”列来缩小范围,因为该列中丢失了一些数据。最后,我想使用类似词嵌入的“主题”列来确保 gpo_full 中的哪些行可以与 CAP_cols 中的唯一行对应。
我尝试使用“日期”列对数据进行分组并将它们合并到 JSON 文件中。但是,我被困在实现下一步以按特定日期和主题缩小范围的过程中。
我对这个输出的想法是:
{
"Date": "January,1997",
"Specific_Date": "January 12,1997"
"Topic": {"GPO": "weather", "CAP": "rain & cloudy"}
"GPO": {
"hearing_sub_type": "Oversight",
and other columns
}
"CAP": {
"majortopic": "25",
"id": "79851",
"Chamber": "1"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 1,1997"
"Topic": {"GPO": "forest", "CAP": "many forest"}
"GPO": {
"hearing_sub_type": "General",
and other columns
}
"CAP": {
"majortopic": "21",
"id": "79846",
"Chamber": "1"
}
and similar for others}
我想了三天了,一点头绪也没有。任何实现这一目标的想法都会非常有帮助!非常感谢!
主题匹配有几个问题,所以你需要扩展我使用的 match_topic()
方法,但我添加了一些逻辑以查看最后没有匹配的内容。
results
变量包含一个字典列表,您可以轻松地将其保存为 JSON 文件。
检查内联注释以了解我使用的逻辑推理。
旁注:
如果我是你,我会稍微重组 JSON。将 topic
作为 key/value 对放在 GPO
和 CAP
键下对我来说比 Topic
键和单独的 GPO
和 CAP
key/value 对...
import csv
from pprint import pprint
import json
# load gpo_full.csv into a list of dict using
# csv.DictReader & list comprehension
with open("path/to/file/gpo_full.csv") as infile:
gpo_full = [item for item in csv.DictReader(infile)]
# do the same for CAP_cols.csv
with open("path/to/file/CAP_cols.csv") as infile:
cap_cols = [item for item in csv.DictReader(infile)]
def match_topic(gpo_topic: str, cap_topic: str) -> bool:
"""We need a function as some of the mapping is not simple
Args:
gpo_topic (str): gpo topic
cap_topic (str): CAP topic
Returns:
bool: True if topics match
"""
# this one is simple
if gpo_topic in cap_topic:
return True
# you need to repeat the below conditional check
# for each custom topic matching
elif gpo_topic == "weather" and cap_topic == "rain & cloudy":
return True
# example secondary topic matching
elif gpo_topic == "foo" and cap_topic == "bar":
return True
# finally return false for no matches
return False
# we need this later
gpo_length = len(gpo_full)
results = []
cap_left_over = []
# do the actual mapping
# this could've been done above, but I separated it intentionally
for cap in cap_cols:
found = False
# first find the corresponding gpo
for index, gpo in enumerate(gpo_full):
if (
gpo["Specific_Date"] == cap["Specific_Date"] # check by date
and match_topic(gpo["topic"], cap["topic"]) # check if topics match
):
results.append({
"Date": gpo["Date"],
"Specific_Date": gpo["Specific_Date"],
"Topic": {
"GPO": gpo["topic"],
"CAP": cap["topic"]
},
"GPO": {
"hearing_sub_type": gpo["hearing_sub_type"]
},
"CAP": {
"majortopic": cap["majortopic"],
"id": cap["id"],
"Chamber": cap["Chamber"]
}
})
# pop & break to remove the gpo item
# this is so you're left over with a list of
# gpo items that didn't match
# it also speeds up further matches
gpo_full.pop(index)
found = True
break
# this is to check if there's stuff left over
if not found:
cap_left_over.append(cap)
with open('path/to/file/combined_json.json', 'w') as outfile:
json.dump(results, outfile, indent=4)
pprint(results)
print(f'\nLength:\n Results: {len(results)}\n CAP: {len(cap)}\n GPO: {gpo_length}')
print('\nLeftover GPO:')
pprint(gpo_full)
print('\nLeftover CAP:')
pprint(cap_left_over)
输出
我已经从输出中删除了 pprint(results)
,请参阅 JSON 进一步向下
Length:
Results: 5
CAP: 6
GPO: 7
Leftover GPO:
[{'Date': 'April,2001',
'Specific_Date': 'NaN ',
'hearing_sub_type': 'Oversight',
'topic': 'people'},
{'Date': 'June,2000',
'Specific_Date': 'June 6,2000',
'hearing_sub_type': 'Oversight',
'topic': 'depressed'}]
Leftover CAP:
[{'Chamber': '2',
'Date': 'June,2000',
'Specific_Date': 'June 6,2000',
'id': '79847',
'majortopic': '4',
'topic': 'emotion'},
{'Chamber': '1',
'Date': 'May,2001',
'Specific_Date': 'NaN',
'id': '79848',
'majortopic': '13',
'topic': 'NaN'}]
path/to/file/gpo_full.csv
Date,hearing_sub_type,topic,Specific_Date
"January,1997",Oversight,weather,"January 12,1997"
"June,2000",General,life,"June 5,2000"
"January,1997",General,forest,"January 1,1997"
"April,2001",Oversight,people,NaN
"June,2000",Oversight,depressed,"June 6,2000"
"January,1997",General,weather,"January 1,1997"
"June,2000",Oversight,depressed,"June 5,2000"
path/to/file/CAP_cols.csv
majortopic,id,Chamber,topic,Date,Specific_Date
21,79846,1,many forest,"January,1997","January 1,1997"
4,79847,2,emotion,"June,2000","June 6,2000"
13,79848,1,NaN,"May,2001","NaN"
7,79849,2,good life,"June,2000","June 5,2000"
21,79850,1,good weather,"January,1997","January 1,1997"
25,79851,1,rain & cloudy,"January,1997","January 12,1997"
6,79852,2,sad & depressed,"June,2000","June 5,2000"
path/to/file/combined_json.json
[
{
"Date": "January,1997",
"Specific_Date": "January 1,1997",
"Topic": {
"GPO": "forest",
"CAP": "many forest"
},
"GPO": {
"hearing_sub_type": "General"
},
"CAP": {
"majortopic": "21",
"id": "79846",
"Chamber": "1"
}
},
{
"Date": "June,2000",
"Specific_Date": "June 5,2000",
"Topic": {
"GPO": "life",
"CAP": "good life"
},
"GPO": {
"hearing_sub_type": "General"
},
"CAP": {
"majortopic": "7",
"id": "79849",
"Chamber": "2"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 1,1997",
"Topic": {
"GPO": "weather",
"CAP": "good weather"
},
"GPO": {
"hearing_sub_type": "General"
},
"CAP": {
"majortopic": "21",
"id": "79850",
"Chamber": "1"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 12,1997",
"Topic": {
"GPO": "weather",
"CAP": "rain & cloudy"
},
"GPO": {
"hearing_sub_type": "Oversight"
},
"CAP": {
"majortopic": "25",
"id": "79851",
"Chamber": "1"
}
},
{
"Date": "June,2000",
"Specific_Date": "June 5,2000",
"Topic": {
"GPO": "depressed",
"CAP": "sad & depressed"
},
"GPO": {
"hearing_sub_type": "Oversight"
},
"CAP": {
"majortopic": "6",
"id": "79852",
"Chamber": "2"
}
}
]