在特定条件下合并 Pandas Dataframe
Merge Pandas Dataframe under certain conditions
我有两组数据框,一组是“黄金”数据框,这意味着我需要在合并后保留黄金数据框的所有行。另一个是参考。
下面是这两个数据框的先睹为快。
gold
doc_name mention id
0 doc_1 US United States
0 doc_1 Georgia Atl
0 doc_1 Bama Selma
0 doc_1 Europe UK
0 doc_2 HSBC HK Bank Central
0 doc_2 NC Charlotte
: : :
: : :
0 doc_n CA San Jose
reference
doc_name text
0 doc_1 The US
0 doc_1 Georgia's Fried Chicken
0 doc_1 Bama Football
0 doc_1 HSBC
0 doc_1 Bank of America
0 doc_1 NC Panthers
0 doc_1 MI Packers
0 doc_1 NC Panthers
: :
: :
0 doc_n CA's apt
我尝试使用外部连接合并这 2 个数据框 df = pd.merge(gold, reference, right_on = ['doc_name'], left_on =['doc_name'], how = 'outer'
然后使用“提及”列中的包含字符串来过滤掉“文本”列下的行,但如果我这样做,我将丢失行黄金数据框,我不想要。
我想要的输出如下所示
doc_name mention id text
0 doc_1 US United States The US
0 doc_1 Georgia Atl Georgia's Fried Chicken
0 doc_1 Bama Selma Bama Football
0 doc_1 Europe UK Nan
0 doc_2 HSBC HK Bank Central HSBC
0 doc_2 NC Charlotte NC Panthers
: : : :
: : : :
0 doc_n CA San Jose CA's apt
我基本上想保留所有 gold 数据框行,但也希望参考数据框中的“文本”列包含 gold 的“提及”列中的字符串。我一直在尝试这样做,但仍然找不到这样做的好方法。如果你们都有一些想法或建议,那就太好了。非常感谢!
黄金原始 csv:
doc_name,mention,id
chtb_165.en,Xinhua News Agency,Xinhua News Agency
chtb_165.en,Shanghai,Shanghai
chtb_165.en,HSBC,HSBC
chtb_165.en,China Shipping Mansion,International Ocean Shipping Building
chtb_165.en,Pudong Lujiazui financial trading district,Lujaizui
chtb_165.en,Pudong,Pudong
chtb_165.en,US,United States
chtb_165.en,Citibank,Citibank
chtb_165.en,Hong Kong,Hong Kong
chtb_165.en,Japan,Japan
chtb_165.en,Tokyo Mitsubishi Bank,The Bank of Tokyo-Mitsubishi UFJ
VOA20001129.2000.036,Washington,"Washington, D.C."
VOA20001129.2000.036,Supreme Court,Supreme Court of the United States
VOA20001129.2000.036,Joe O'Grossman,Joel Grossman
VOA20001129.2000.036,Baltimore,Baltimore
VOA20001129.2000.036,Johns Hopkins University,Johns Hopkins University
VOA20001129.2000.036,Lawrence Tribe,Laurence Tribe
VOA20001129.2000.036,Gore,Al Gore
VOA20001129.2000.036,legislature,Florida Legislature
VOA20001129.2000.036,Congress,United States Congress
参考原始 csv:
doc_name,text
VOA20001129.2000.036,the Bush
VOA20001129.2000.036,American election
VOA20001129.2000.036,Congress
VOA20001129.2000.036,George W Bush
chtb_165.en,Xinhua News Agency
chtb_165.en,Shanghai
chtb_165.en,HSBC
chtb_165.en,China Shipping
chtb_165.en,Mansion
chtb_165.en,RMB
chtb_165.en,the US
chtb_165.en,"Citibank , Hong Kong"
chtb_165.en,Japan
chtb_165.en,Tokyo Mitsubishi Bank
chtb_165.en,Industrial Bank
chtb_165.en,Branch
chtb_165.en,Chartered Bank
chtb_165.en,BNP
chtb_165.en,Paris
chtb_165.en,Bank
chtb_165.en,Dai-Ichi Kangyo Bank
chtb_165.en,Sanwa Bank
chtb_165.en,Financial Trading
chtb_165.en,District
chtb_165.en,Franklin Templeton
chtb_165.en,Company
chtb_165.en,California
chtb_165.en,US dollars
chtb_165.en,China
chtb_165.en,Asian
chtb_165.en,Securities
chtb_165.en,Building
chtb_165.en,Hong Kong
chtb_165.en,Japan Industrial Bank
chtb_165.en,Holland
chtb_165.en,Belgium
chtb_165.en,Credit Bank
chtb_165.en,Waitan
这里有你想要的答案。它会生成一个“output.csv”,您可以使用 pandas 将其作为数据框读取,从而为您提供预期的结果。
这是我的“output.csv”。结果看起来很奇怪,因为您的样本输入(reference.csv 和 gold.csv)是一小部分。如果您在完整的大型输入 CSV 上进行测试,您将获得正确的输出:
doc_name,mention,id,text
VOA20001129.2000.036,Washington,Washington D.C.,
VOA20001129.2000.036,Supreme Court,Supreme Court of the United States,
VOA20001129.2000.036,Joe O'Grossman,Joel Grossman,
VOA20001129.2000.036,Baltimore,Baltimore,
VOA20001129.2000.036,Johns Hopkins University,Johns Hopkins University,
VOA20001129.2000.036,Lawrence Tribe,Laurence Tribe,
VOA20001129.2000.036,Gore,Al Gore,
VOA20001129.2000.036,legislature,Florida Legislature,
VOA20001129.2000.036,Congress,United States Congress,Congress
chtb_165.en,Xinhua News Agency,Xinhua News Agency,Xinhua News Agency
chtb_165.en,Shanghai,Shanghai,Shanghai
chtb_165.en,HSBC,HSBC,HSBC
chtb_165.en,China Shipping Mansion,International Ocean Shipping Building,
chtb_165.en,Pudong Lujiazui financial trading district,Lujaizui,
chtb_165.en,Pudong,Pudong,
chtb_165.en,US,United States,the US
chtb_165.en,Citibank,Citibank,Citibank Hong Kong
chtb_165.en,Hong Kong,Hong Kong,Citibank Hong Kong
chtb_165.en,Japan,Japan,Japan
chtb_165.en,Tokyo Mitsubishi Bank,The Bank of Tokyo-Mitsubishi UFJ,Tokyo Mitsubishi Bank
最后,这是代码:
from collections import OrderedDict
import inspect
"""
Note: Only works on Python 3.6+
"""
class GoldClass:
def __init__(self):
self.mention = []
self.id = []
def retrieve_name(var):
callers_local_vars = inspect.currentframe().f_back.f_locals.items()
return [var_name for var_name, var_val in callers_local_vars if var_val is var][0]
def get_nth_key(dictionary, n):
if n < 0:
n += len(dictionary)
for i, key in enumerate(dictionary.keys()):
if i == n:
return key
raise IndexError("dictionary index out of range")
with open("reference.csv") as reference_file:
reference_list = reference_file.readlines()
with open("gold.csv") as gold_file:
gold_list = gold_file.readlines()
reference_dict = OrderedDict()
for x in range(len(reference_list)):
if x == 0:
continue
reference_list[x] = reference_list[x].strip()
if reference_list[x].count(',') > 1:
temp1 = reference_list[x].split(",")[0]
temp2 = reference_list[x][len(temp1)+1:]
temp2 = temp2.replace(",","").replace('"',"")
reference_list[x] = temp1+","+temp2
try:
reference_dict[reference_list[x].split(",")[0]]
except:
reference_dict[reference_list[x].split(",")[0]] = []
reference_dict[reference_list[x].split(",")[0]].append(reference_list[x].split(",")[1])
for x in range(len(gold_list)):
if x == 0:
continue
gold_list[x] = gold_list[x].strip()
if gold_list[x].count(',') > 2:
temp1 = gold_list[x].split(",")[0]
temp2 = gold_list[x].split(",")[1]
temp3 = gold_list[x][len(temp1)+len(temp2)+2:]
temp3 = temp3.replace(",","").replace('"',"")
gold_list[x] = temp1+","+temp2+","+temp3
temp_doc_name = gold_list[x].split(",")[0]
temp_mention = gold_list[x].split(",")[1]
temp_id = gold_list[x].split(",")[2]
temp_index = list(reference_dict.keys()).index(temp_doc_name)
try:
exec("goldclass_"+str(temp_index))
except:
exec("goldclass_"+str(temp_index)+" = GoldClass()")
exec("goldclass_"+str(temp_index)+".mention.append(temp_mention)")
exec("goldclass_"+str(temp_index)+".id.append(temp_id)")
goldclass_objectlist = []
goldclass_iterator = 0
while True:
try:
exec("goldclass_objectlist.append(goldclass_"+str(goldclass_iterator)+")")
goldclass_iterator = goldclass_iterator + 1
except:
break
final_lines = []
final_lines.append("doc_name,mention,id,text")
for temp4 in goldclass_objectlist:
final_doc_name = get_nth_key(reference_dict,int(retrieve_name(temp4).split("_")[1]))
for x in range(len(temp4.id)):
final_mention = temp4.mention[x]
final_id = temp4.id[x]
final_text = ""
for y in reference_dict[final_doc_name]:
if final_mention in y:
final_text = y
break
final_lines.append(final_doc_name+","+final_mention+","+final_id+","+final_text)
f = open("output.csv", "w")
for x in final_lines:
f.write(x+"\n")
f.close()
当参考文献中有多个文本与黄金相同的提及时,你想如何处理?这些会创建重复的行。
鉴于:
gold.csv
doc_name,mention,id
doc_1,US,United States
doc_1,Georgia,Atl
doc_1,Bama,Selma
doc_1,Europe,UK
doc_2,HSBC,HK Bank Central
doc_2,NC,Charlotte
chtb_165.en,Xinhua News Agency,Xinhua News Agency
chtb_165.en,Shanghai,Shanghai
chtb_165.en,HSBC,HSBC
chtb_165.en,China Shipping Mansion,International Ocean Shipping Building
chtb_165.en,Pudong Lujiazui financial trading district,Lujaizui
chtb_165.en,Pudong,Pudong
chtb_165.en,US,United States
chtb_165.en,Citibank,Citibank
chtb_165.en,Hong Kong,Hong Kong
chtb_165.en,Japan,Japan
chtb_165.en,Tokyo Mitsubishi Bank,The Bank of Tokyo-Mitsubishi UFJ
VOA20001129.2000.036,Washington,"Washington, D.C."
VOA20001129.2000.036,Supreme Court,Supreme Court of the United States
VOA20001129.2000.036,Joe O'Grossman,Joel Grossman
VOA20001129.2000.036,Baltimore,Baltimore
VOA20001129.2000.036,Johns Hopkins University,Johns Hopkins University
VOA20001129.2000.036,Lawrence Tribe,Laurence Tribe
VOA20001129.2000.036,Gore,Al Gore
VOA20001129.2000.036,legislature,Florida Legislature
VOA20001129.2000.036,Congress,United States Congress
reference.csv
doc_name,text
doc_1,The US
doc_1,Georgia's Fried Chicken
doc_1,Bama Football
doc_1,HSBC
doc_1,Bank of America
doc_1,NC Panthers
doc_1,MI Packers
doc_1,NC Panthers
VOA20001129.2000.036,the Bush
VOA20001129.2000.036,American election
VOA20001129.2000.036,Congress
VOA20001129.2000.036,George W Bush
chtb_165.en,Xinhua News Agency
chtb_165.en,Shanghai
chtb_165.en,HSBC
chtb_165.en,China Shipping
chtb_165.en,Mansion
chtb_165.en,RMB
chtb_165.en,the US
chtb_165.en,"Citibank , Hong Kong"
chtb_165.en,Japan
chtb_165.en,Tokyo Mitsubishi Bank
chtb_165.en,Industrial Bank
chtb_165.en,Branch
chtb_165.en,Chartered Bank
chtb_165.en,BNP
chtb_165.en,Paris
chtb_165.en,Bank
chtb_165.en,Dai-Ichi Kangyo Bank
chtb_165.en,Sanwa Bank
chtb_165.en,Financial Trading
chtb_165.en,District
chtb_165.en,Franklin Templeton
chtb_165.en,Company
chtb_165.en,California
chtb_165.en,US dollars
chtb_165.en,China
chtb_165.en,Asian
chtb_165.en,Securities
chtb_165.en,Building
chtb_165.en,Hong Kong
chtb_165.en,Japan Industrial Bank
chtb_165.en,Holland
chtb_165.en,Belgium
chtb_165.en,Credit Bank
chtb_165.en,Waitan
创建一个列,使用或 |
运算符查找文本中提及的内容。一旦它与提到的文本匹配,就可以合并。
import pandas as pd
gold = pd.read_csv('C:/test/gold.csv')
reference = pd.read_csv('C:/test/reference.csv')
pat = '|'.join(r"{}".format(x) for x in gold.mention)
reference['mention_test'] = reference.text.str.extract('('+ pat + ')', expand=False)
df = pd.merge(gold, reference, how='left', left_on= ['doc_name','mention'], right_on=['doc_name','mention_test']).drop('mention_test', axis=1)
df.to_csv('output.csv', index=False)
输出:
print(df.to_string())
doc_name mention id text
0 doc_1 US United States The US
1 doc_1 Georgia Atl Georgia's Fried Chicken
2 doc_1 Bama Selma Bama Football
3 doc_1 Europe UK NaN
4 doc_2 HSBC HK Bank Central NaN
5 doc_2 NC Charlotte NaN
6 chtb_165.en Xinhua News Agency Xinhua News Agency Xinhua News Agency
7 chtb_165.en Shanghai Shanghai Shanghai
8 chtb_165.en HSBC HSBC HSBC
9 chtb_165.en China Shipping Mansion International Ocean Shipping Building NaN
10 chtb_165.en Pudong Lujiazui financial trading district Lujaizui NaN
11 chtb_165.en Pudong Pudong NaN
12 chtb_165.en US United States the US
13 chtb_165.en US United States US dollars
14 chtb_165.en Citibank Citibank Citibank , Hong Kong
15 chtb_165.en Hong Kong Hong Kong Hong Kong
16 chtb_165.en Japan Japan Japan
17 chtb_165.en Japan Japan Japan Industrial Bank
18 chtb_165.en Tokyo Mitsubishi Bank The Bank of Tokyo-Mitsubishi UFJ Tokyo Mitsubishi Bank
19 VOA20001129.2000.036 Washington Washington, D.C. NaN
20 VOA20001129.2000.036 Supreme Court Supreme Court of the United States NaN
21 VOA20001129.2000.036 Joe O'Grossman Joel Grossman NaN
22 VOA20001129.2000.036 Baltimore Baltimore NaN
23 VOA20001129.2000.036 Johns Hopkins University Johns Hopkins University NaN
24 VOA20001129.2000.036 Lawrence Tribe Laurence Tribe NaN
25 VOA20001129.2000.036 Gore Al Gore NaN
26 VOA20001129.2000.036 legislature Florida Legislature NaN
27 VOA20001129.2000.036 Congress United States Congress Congress
附加:
将这些额外的行合并为 1 行(保持与 gold.csv 开头相同的行数:
import pandas as pd
pat = '|'.join(r"{}".format(x) for x in gold.mention)
reference['mention_test'] = reference.text.str.extract('('+ pat + ')', expand=False)
df = pd.merge(gold, reference, how='left', left_on= ['doc_name','mention'], right_on=['doc_name','mention_test']).drop('mention_test', axis=1)
duplicates = df[df.duplicated(subset=['doc_name','mention','id'], keep=False)]
aux = duplicates.groupby(['doc_name','mention','id'])['text'].apply('; '.join).reset_index()
df = df.drop(duplicates.index)
df = df.append(aux).reset_index(drop=True)
df.to_csv('output.csv', index=False)
输出:
print(df.to_string())
doc_name mention id text
0 doc_1 US United States The US
1 doc_1 Georgia Atl Georgia's Fried Chicken
2 doc_1 Bama Selma Bama Football
3 doc_1 Europe UK NaN
4 doc_2 HSBC HK Bank Central NaN
5 doc_2 NC Charlotte NaN
6 chtb_165.en Xinhua News Agency Xinhua News Agency Xinhua News Agency
7 chtb_165.en Shanghai Shanghai Shanghai
8 chtb_165.en HSBC HSBC HSBC
9 chtb_165.en China Shipping Mansion International Ocean Shipping Building NaN
10 chtb_165.en Pudong Lujiazui financial trading district Lujaizui NaN
11 chtb_165.en Pudong Pudong NaN
12 chtb_165.en Citibank Citibank Citibank , Hong Kong
13 chtb_165.en Hong Kong Hong Kong Hong Kong
14 chtb_165.en Tokyo Mitsubishi Bank The Bank of Tokyo-Mitsubishi UFJ Tokyo Mitsubishi Bank
15 VOA20001129.2000.036 Washington Washington, D.C. NaN
16 VOA20001129.2000.036 Supreme Court Supreme Court of the United States NaN
17 VOA20001129.2000.036 Joe O'Grossman Joel Grossman NaN
18 VOA20001129.2000.036 Baltimore Baltimore NaN
19 VOA20001129.2000.036 Johns Hopkins University Johns Hopkins University NaN
20 VOA20001129.2000.036 Lawrence Tribe Laurence Tribe NaN
21 VOA20001129.2000.036 Gore Al Gore NaN
22 VOA20001129.2000.036 legislature Florida Legislature NaN
23 VOA20001129.2000.036 Congress United States Congress Congress
24 chtb_165.en Japan Japan Japan; Japan Industrial Bank
25 chtb_165.en US United States the US; US dollars
成瘾 2:
最后,为了保留第一个,我们将删除重复项,但保留第一个实例:
import pandas as pd
gold = pd.read_csv('C:/test/gold.csv')
reference = pd.read_csv('C:/test/reference.csv')
pat = '|'.join(r"{}".format(x) for x in gold.mention)
reference['mention_test'] = reference.text.str.extract('('+ pat + ')', expand=False)
df = pd.merge(gold, reference, how='left', left_on= ['doc_name','mention'], right_on=['doc_name','mention_test']).drop('mention_test', axis=1)
df = df.drop_duplicates(subset=['doc_name','mention','id'], keep='first')
df.to_csv('output.csv', index=False)
我有两组数据框,一组是“黄金”数据框,这意味着我需要在合并后保留黄金数据框的所有行。另一个是参考。 下面是这两个数据框的先睹为快。
gold
doc_name mention id
0 doc_1 US United States
0 doc_1 Georgia Atl
0 doc_1 Bama Selma
0 doc_1 Europe UK
0 doc_2 HSBC HK Bank Central
0 doc_2 NC Charlotte
: : :
: : :
0 doc_n CA San Jose
reference
doc_name text
0 doc_1 The US
0 doc_1 Georgia's Fried Chicken
0 doc_1 Bama Football
0 doc_1 HSBC
0 doc_1 Bank of America
0 doc_1 NC Panthers
0 doc_1 MI Packers
0 doc_1 NC Panthers
: :
: :
0 doc_n CA's apt
我尝试使用外部连接合并这 2 个数据框 df = pd.merge(gold, reference, right_on = ['doc_name'], left_on =['doc_name'], how = 'outer'
然后使用“提及”列中的包含字符串来过滤掉“文本”列下的行,但如果我这样做,我将丢失行黄金数据框,我不想要。
我想要的输出如下所示
doc_name mention id text
0 doc_1 US United States The US
0 doc_1 Georgia Atl Georgia's Fried Chicken
0 doc_1 Bama Selma Bama Football
0 doc_1 Europe UK Nan
0 doc_2 HSBC HK Bank Central HSBC
0 doc_2 NC Charlotte NC Panthers
: : : :
: : : :
0 doc_n CA San Jose CA's apt
我基本上想保留所有 gold 数据框行,但也希望参考数据框中的“文本”列包含 gold 的“提及”列中的字符串。我一直在尝试这样做,但仍然找不到这样做的好方法。如果你们都有一些想法或建议,那就太好了。非常感谢!
黄金原始 csv:
doc_name,mention,id
chtb_165.en,Xinhua News Agency,Xinhua News Agency
chtb_165.en,Shanghai,Shanghai
chtb_165.en,HSBC,HSBC
chtb_165.en,China Shipping Mansion,International Ocean Shipping Building
chtb_165.en,Pudong Lujiazui financial trading district,Lujaizui
chtb_165.en,Pudong,Pudong
chtb_165.en,US,United States
chtb_165.en,Citibank,Citibank
chtb_165.en,Hong Kong,Hong Kong
chtb_165.en,Japan,Japan
chtb_165.en,Tokyo Mitsubishi Bank,The Bank of Tokyo-Mitsubishi UFJ
VOA20001129.2000.036,Washington,"Washington, D.C."
VOA20001129.2000.036,Supreme Court,Supreme Court of the United States
VOA20001129.2000.036,Joe O'Grossman,Joel Grossman
VOA20001129.2000.036,Baltimore,Baltimore
VOA20001129.2000.036,Johns Hopkins University,Johns Hopkins University
VOA20001129.2000.036,Lawrence Tribe,Laurence Tribe
VOA20001129.2000.036,Gore,Al Gore
VOA20001129.2000.036,legislature,Florida Legislature
VOA20001129.2000.036,Congress,United States Congress
参考原始 csv:
doc_name,text
VOA20001129.2000.036,the Bush
VOA20001129.2000.036,American election
VOA20001129.2000.036,Congress
VOA20001129.2000.036,George W Bush
chtb_165.en,Xinhua News Agency
chtb_165.en,Shanghai
chtb_165.en,HSBC
chtb_165.en,China Shipping
chtb_165.en,Mansion
chtb_165.en,RMB
chtb_165.en,the US
chtb_165.en,"Citibank , Hong Kong"
chtb_165.en,Japan
chtb_165.en,Tokyo Mitsubishi Bank
chtb_165.en,Industrial Bank
chtb_165.en,Branch
chtb_165.en,Chartered Bank
chtb_165.en,BNP
chtb_165.en,Paris
chtb_165.en,Bank
chtb_165.en,Dai-Ichi Kangyo Bank
chtb_165.en,Sanwa Bank
chtb_165.en,Financial Trading
chtb_165.en,District
chtb_165.en,Franklin Templeton
chtb_165.en,Company
chtb_165.en,California
chtb_165.en,US dollars
chtb_165.en,China
chtb_165.en,Asian
chtb_165.en,Securities
chtb_165.en,Building
chtb_165.en,Hong Kong
chtb_165.en,Japan Industrial Bank
chtb_165.en,Holland
chtb_165.en,Belgium
chtb_165.en,Credit Bank
chtb_165.en,Waitan
这里有你想要的答案。它会生成一个“output.csv”,您可以使用 pandas 将其作为数据框读取,从而为您提供预期的结果。
这是我的“output.csv”。结果看起来很奇怪,因为您的样本输入(reference.csv 和 gold.csv)是一小部分。如果您在完整的大型输入 CSV 上进行测试,您将获得正确的输出:
doc_name,mention,id,text
VOA20001129.2000.036,Washington,Washington D.C.,
VOA20001129.2000.036,Supreme Court,Supreme Court of the United States,
VOA20001129.2000.036,Joe O'Grossman,Joel Grossman,
VOA20001129.2000.036,Baltimore,Baltimore,
VOA20001129.2000.036,Johns Hopkins University,Johns Hopkins University,
VOA20001129.2000.036,Lawrence Tribe,Laurence Tribe,
VOA20001129.2000.036,Gore,Al Gore,
VOA20001129.2000.036,legislature,Florida Legislature,
VOA20001129.2000.036,Congress,United States Congress,Congress
chtb_165.en,Xinhua News Agency,Xinhua News Agency,Xinhua News Agency
chtb_165.en,Shanghai,Shanghai,Shanghai
chtb_165.en,HSBC,HSBC,HSBC
chtb_165.en,China Shipping Mansion,International Ocean Shipping Building,
chtb_165.en,Pudong Lujiazui financial trading district,Lujaizui,
chtb_165.en,Pudong,Pudong,
chtb_165.en,US,United States,the US
chtb_165.en,Citibank,Citibank,Citibank Hong Kong
chtb_165.en,Hong Kong,Hong Kong,Citibank Hong Kong
chtb_165.en,Japan,Japan,Japan
chtb_165.en,Tokyo Mitsubishi Bank,The Bank of Tokyo-Mitsubishi UFJ,Tokyo Mitsubishi Bank
最后,这是代码:
from collections import OrderedDict
import inspect
"""
Note: Only works on Python 3.6+
"""
class GoldClass:
def __init__(self):
self.mention = []
self.id = []
def retrieve_name(var):
callers_local_vars = inspect.currentframe().f_back.f_locals.items()
return [var_name for var_name, var_val in callers_local_vars if var_val is var][0]
def get_nth_key(dictionary, n):
if n < 0:
n += len(dictionary)
for i, key in enumerate(dictionary.keys()):
if i == n:
return key
raise IndexError("dictionary index out of range")
with open("reference.csv") as reference_file:
reference_list = reference_file.readlines()
with open("gold.csv") as gold_file:
gold_list = gold_file.readlines()
reference_dict = OrderedDict()
for x in range(len(reference_list)):
if x == 0:
continue
reference_list[x] = reference_list[x].strip()
if reference_list[x].count(',') > 1:
temp1 = reference_list[x].split(",")[0]
temp2 = reference_list[x][len(temp1)+1:]
temp2 = temp2.replace(",","").replace('"',"")
reference_list[x] = temp1+","+temp2
try:
reference_dict[reference_list[x].split(",")[0]]
except:
reference_dict[reference_list[x].split(",")[0]] = []
reference_dict[reference_list[x].split(",")[0]].append(reference_list[x].split(",")[1])
for x in range(len(gold_list)):
if x == 0:
continue
gold_list[x] = gold_list[x].strip()
if gold_list[x].count(',') > 2:
temp1 = gold_list[x].split(",")[0]
temp2 = gold_list[x].split(",")[1]
temp3 = gold_list[x][len(temp1)+len(temp2)+2:]
temp3 = temp3.replace(",","").replace('"',"")
gold_list[x] = temp1+","+temp2+","+temp3
temp_doc_name = gold_list[x].split(",")[0]
temp_mention = gold_list[x].split(",")[1]
temp_id = gold_list[x].split(",")[2]
temp_index = list(reference_dict.keys()).index(temp_doc_name)
try:
exec("goldclass_"+str(temp_index))
except:
exec("goldclass_"+str(temp_index)+" = GoldClass()")
exec("goldclass_"+str(temp_index)+".mention.append(temp_mention)")
exec("goldclass_"+str(temp_index)+".id.append(temp_id)")
goldclass_objectlist = []
goldclass_iterator = 0
while True:
try:
exec("goldclass_objectlist.append(goldclass_"+str(goldclass_iterator)+")")
goldclass_iterator = goldclass_iterator + 1
except:
break
final_lines = []
final_lines.append("doc_name,mention,id,text")
for temp4 in goldclass_objectlist:
final_doc_name = get_nth_key(reference_dict,int(retrieve_name(temp4).split("_")[1]))
for x in range(len(temp4.id)):
final_mention = temp4.mention[x]
final_id = temp4.id[x]
final_text = ""
for y in reference_dict[final_doc_name]:
if final_mention in y:
final_text = y
break
final_lines.append(final_doc_name+","+final_mention+","+final_id+","+final_text)
f = open("output.csv", "w")
for x in final_lines:
f.write(x+"\n")
f.close()
当参考文献中有多个文本与黄金相同的提及时,你想如何处理?这些会创建重复的行。
鉴于:
gold.csv
doc_name,mention,id
doc_1,US,United States
doc_1,Georgia,Atl
doc_1,Bama,Selma
doc_1,Europe,UK
doc_2,HSBC,HK Bank Central
doc_2,NC,Charlotte
chtb_165.en,Xinhua News Agency,Xinhua News Agency
chtb_165.en,Shanghai,Shanghai
chtb_165.en,HSBC,HSBC
chtb_165.en,China Shipping Mansion,International Ocean Shipping Building
chtb_165.en,Pudong Lujiazui financial trading district,Lujaizui
chtb_165.en,Pudong,Pudong
chtb_165.en,US,United States
chtb_165.en,Citibank,Citibank
chtb_165.en,Hong Kong,Hong Kong
chtb_165.en,Japan,Japan
chtb_165.en,Tokyo Mitsubishi Bank,The Bank of Tokyo-Mitsubishi UFJ
VOA20001129.2000.036,Washington,"Washington, D.C."
VOA20001129.2000.036,Supreme Court,Supreme Court of the United States
VOA20001129.2000.036,Joe O'Grossman,Joel Grossman
VOA20001129.2000.036,Baltimore,Baltimore
VOA20001129.2000.036,Johns Hopkins University,Johns Hopkins University
VOA20001129.2000.036,Lawrence Tribe,Laurence Tribe
VOA20001129.2000.036,Gore,Al Gore
VOA20001129.2000.036,legislature,Florida Legislature
VOA20001129.2000.036,Congress,United States Congress
reference.csv
doc_name,text
doc_1,The US
doc_1,Georgia's Fried Chicken
doc_1,Bama Football
doc_1,HSBC
doc_1,Bank of America
doc_1,NC Panthers
doc_1,MI Packers
doc_1,NC Panthers
VOA20001129.2000.036,the Bush
VOA20001129.2000.036,American election
VOA20001129.2000.036,Congress
VOA20001129.2000.036,George W Bush
chtb_165.en,Xinhua News Agency
chtb_165.en,Shanghai
chtb_165.en,HSBC
chtb_165.en,China Shipping
chtb_165.en,Mansion
chtb_165.en,RMB
chtb_165.en,the US
chtb_165.en,"Citibank , Hong Kong"
chtb_165.en,Japan
chtb_165.en,Tokyo Mitsubishi Bank
chtb_165.en,Industrial Bank
chtb_165.en,Branch
chtb_165.en,Chartered Bank
chtb_165.en,BNP
chtb_165.en,Paris
chtb_165.en,Bank
chtb_165.en,Dai-Ichi Kangyo Bank
chtb_165.en,Sanwa Bank
chtb_165.en,Financial Trading
chtb_165.en,District
chtb_165.en,Franklin Templeton
chtb_165.en,Company
chtb_165.en,California
chtb_165.en,US dollars
chtb_165.en,China
chtb_165.en,Asian
chtb_165.en,Securities
chtb_165.en,Building
chtb_165.en,Hong Kong
chtb_165.en,Japan Industrial Bank
chtb_165.en,Holland
chtb_165.en,Belgium
chtb_165.en,Credit Bank
chtb_165.en,Waitan
创建一个列,使用或 |
运算符查找文本中提及的内容。一旦它与提到的文本匹配,就可以合并。
import pandas as pd
gold = pd.read_csv('C:/test/gold.csv')
reference = pd.read_csv('C:/test/reference.csv')
pat = '|'.join(r"{}".format(x) for x in gold.mention)
reference['mention_test'] = reference.text.str.extract('('+ pat + ')', expand=False)
df = pd.merge(gold, reference, how='left', left_on= ['doc_name','mention'], right_on=['doc_name','mention_test']).drop('mention_test', axis=1)
df.to_csv('output.csv', index=False)
输出:
print(df.to_string())
doc_name mention id text
0 doc_1 US United States The US
1 doc_1 Georgia Atl Georgia's Fried Chicken
2 doc_1 Bama Selma Bama Football
3 doc_1 Europe UK NaN
4 doc_2 HSBC HK Bank Central NaN
5 doc_2 NC Charlotte NaN
6 chtb_165.en Xinhua News Agency Xinhua News Agency Xinhua News Agency
7 chtb_165.en Shanghai Shanghai Shanghai
8 chtb_165.en HSBC HSBC HSBC
9 chtb_165.en China Shipping Mansion International Ocean Shipping Building NaN
10 chtb_165.en Pudong Lujiazui financial trading district Lujaizui NaN
11 chtb_165.en Pudong Pudong NaN
12 chtb_165.en US United States the US
13 chtb_165.en US United States US dollars
14 chtb_165.en Citibank Citibank Citibank , Hong Kong
15 chtb_165.en Hong Kong Hong Kong Hong Kong
16 chtb_165.en Japan Japan Japan
17 chtb_165.en Japan Japan Japan Industrial Bank
18 chtb_165.en Tokyo Mitsubishi Bank The Bank of Tokyo-Mitsubishi UFJ Tokyo Mitsubishi Bank
19 VOA20001129.2000.036 Washington Washington, D.C. NaN
20 VOA20001129.2000.036 Supreme Court Supreme Court of the United States NaN
21 VOA20001129.2000.036 Joe O'Grossman Joel Grossman NaN
22 VOA20001129.2000.036 Baltimore Baltimore NaN
23 VOA20001129.2000.036 Johns Hopkins University Johns Hopkins University NaN
24 VOA20001129.2000.036 Lawrence Tribe Laurence Tribe NaN
25 VOA20001129.2000.036 Gore Al Gore NaN
26 VOA20001129.2000.036 legislature Florida Legislature NaN
27 VOA20001129.2000.036 Congress United States Congress Congress
附加:
将这些额外的行合并为 1 行(保持与 gold.csv 开头相同的行数:
import pandas as pd
pat = '|'.join(r"{}".format(x) for x in gold.mention)
reference['mention_test'] = reference.text.str.extract('('+ pat + ')', expand=False)
df = pd.merge(gold, reference, how='left', left_on= ['doc_name','mention'], right_on=['doc_name','mention_test']).drop('mention_test', axis=1)
duplicates = df[df.duplicated(subset=['doc_name','mention','id'], keep=False)]
aux = duplicates.groupby(['doc_name','mention','id'])['text'].apply('; '.join).reset_index()
df = df.drop(duplicates.index)
df = df.append(aux).reset_index(drop=True)
df.to_csv('output.csv', index=False)
输出:
print(df.to_string())
doc_name mention id text
0 doc_1 US United States The US
1 doc_1 Georgia Atl Georgia's Fried Chicken
2 doc_1 Bama Selma Bama Football
3 doc_1 Europe UK NaN
4 doc_2 HSBC HK Bank Central NaN
5 doc_2 NC Charlotte NaN
6 chtb_165.en Xinhua News Agency Xinhua News Agency Xinhua News Agency
7 chtb_165.en Shanghai Shanghai Shanghai
8 chtb_165.en HSBC HSBC HSBC
9 chtb_165.en China Shipping Mansion International Ocean Shipping Building NaN
10 chtb_165.en Pudong Lujiazui financial trading district Lujaizui NaN
11 chtb_165.en Pudong Pudong NaN
12 chtb_165.en Citibank Citibank Citibank , Hong Kong
13 chtb_165.en Hong Kong Hong Kong Hong Kong
14 chtb_165.en Tokyo Mitsubishi Bank The Bank of Tokyo-Mitsubishi UFJ Tokyo Mitsubishi Bank
15 VOA20001129.2000.036 Washington Washington, D.C. NaN
16 VOA20001129.2000.036 Supreme Court Supreme Court of the United States NaN
17 VOA20001129.2000.036 Joe O'Grossman Joel Grossman NaN
18 VOA20001129.2000.036 Baltimore Baltimore NaN
19 VOA20001129.2000.036 Johns Hopkins University Johns Hopkins University NaN
20 VOA20001129.2000.036 Lawrence Tribe Laurence Tribe NaN
21 VOA20001129.2000.036 Gore Al Gore NaN
22 VOA20001129.2000.036 legislature Florida Legislature NaN
23 VOA20001129.2000.036 Congress United States Congress Congress
24 chtb_165.en Japan Japan Japan; Japan Industrial Bank
25 chtb_165.en US United States the US; US dollars
成瘾 2:
最后,为了保留第一个,我们将删除重复项,但保留第一个实例:
import pandas as pd
gold = pd.read_csv('C:/test/gold.csv')
reference = pd.read_csv('C:/test/reference.csv')
pat = '|'.join(r"{}".format(x) for x in gold.mention)
reference['mention_test'] = reference.text.str.extract('('+ pat + ')', expand=False)
df = pd.merge(gold, reference, how='left', left_on= ['doc_name','mention'], right_on=['doc_name','mention_test']).drop('mention_test', axis=1)
df = df.drop_duplicates(subset=['doc_name','mention','id'], keep='first')
df.to_csv('output.csv', index=False)