如何比较 2 个 CSV 文件,检查第二列的值是否匹配并计算每个值匹配时出现的次数?
How can I compare 2 CSV files, check if the values of the second column match and count the number of occurrences for each value when they match?
我想遍历 2 个 CSV 文件,检查两个文件中的值何时匹配,并计算每个值匹配时出现的次数。输出应该是一个字典。
所以我有两个对齐的 CSV 文件。每个都有 2 列:"WORD" 和 "POS"(词性标记)。
Click to see example of file 1
Click to see example of file 2
在某些情况下,两个文件的每个单词都以相同的方式标记,但在许多其他情况下则不同。我想计算两个文件以相同方式标记的次数。
例如,如果 file1 有 WORD "human" 和 POS "PERS",file2 也有 WORD "human" 和 POS "PERS",我希望输出为:{PERS:2}
这意味着 PERS 在两个文件中匹配了两次。我希望每个标签都这样:
{TAG1:它出现 n 次并匹配两者,TAG2:它出现并匹配两者的次数,等等}
我只能弄清楚如何读取 一个 CSV 文件 并使用以下代码计算每个 POS 标签的使用次数:
import csv
from collections import defaultdict
def count_NER_tags(filename):
"""
Obtains the counts of each tag for the determined csv file
"""
dict_NER_counts = defaultdict(int)
with open(filename, "r") as csvfile:
read_csv = csv.reader(csvfile, delimiter="\t")
next(read_csv) #skip the header
for row in read_csv:
dict_NER_counts[row[2]] += 1
return dict_NER_counts
output:
{'O': 42123, 'ORG': 2092, 'LOC': 2094, 'MISC': 1268, 'PERS': 3145}
我不知道如何在读取两个 CSV 文件后实现 "if POS in file1 == POS in file2",然后将它们的计数添加到字典中,如上面的代码所示。
- 使用 pandas.read_csv()
将两个文件读取为 csv
- 合并两个数据框
- 分组
POS
并计算行数
- 根据聚合数据框创建字典
代码看起来像这样
import pandas as pd
df1 = pd.read_csv('path to file 1')
df2 = pd.read_csv('path to file 2')
# As column names are same it would be merged on both the columns
df_merged = df1.merge(df2)
#Count occurrences of `WORD` grouped by `POS`
df_merged = df_merged.groupby(['POS']).count().reset_index()
tags_dict = dict(zip(df_merged['POS'], df_merged['WORD']))
希望对您有所帮助!
我觉得有点奇怪,当同一个 WORD 在两个文件中有相同的 POS 时,你称它为两个匹配项而不是一个匹配项——在我看来这只是 一个匹配。
随便...我认为下面会做你想做的事(如果我已经正确理解你想做的事)。
import csv
from collections import defaultdict
def count_tag_matches(filename1, filename2):
"""
Counts the number of tags that had the same value in both CSV files.
"""
dict_counts = defaultdict(int)
with open(filename1, "r", newline='') as csvfile1, \
open(filename2, "r", newline='') as csvfile2:
reader1 = csv.DictReader(csvfile1, delimiter="\t")
reader2 = csv.DictReader(csvfile2, delimiter="\t")
for row1, row2 in zip(reader1, reader2):
if row1['POS'] == row2['POS']:
dict_counts[row1['POS']] += 2
return dict(dict_counts) # Return a regular dictionary.
counts = count_tag_matches('cmp_file1.csv', 'cmp_file2.csv')
print(counts)
处理示例文件的输出:
{'A': 2, 'O': 2, 'PERS': 2}
我想遍历 2 个 CSV 文件,检查两个文件中的值何时匹配,并计算每个值匹配时出现的次数。输出应该是一个字典。
所以我有两个对齐的 CSV 文件。每个都有 2 列:"WORD" 和 "POS"(词性标记)。 Click to see example of file 1 Click to see example of file 2
在某些情况下,两个文件的每个单词都以相同的方式标记,但在许多其他情况下则不同。我想计算两个文件以相同方式标记的次数。
例如,如果 file1 有 WORD "human" 和 POS "PERS",file2 也有 WORD "human" 和 POS "PERS",我希望输出为:{PERS:2} 这意味着 PERS 在两个文件中匹配了两次。我希望每个标签都这样: {TAG1:它出现 n 次并匹配两者,TAG2:它出现并匹配两者的次数,等等}
我只能弄清楚如何读取 一个 CSV 文件 并使用以下代码计算每个 POS 标签的使用次数:
import csv
from collections import defaultdict
def count_NER_tags(filename):
"""
Obtains the counts of each tag for the determined csv file
"""
dict_NER_counts = defaultdict(int)
with open(filename, "r") as csvfile:
read_csv = csv.reader(csvfile, delimiter="\t")
next(read_csv) #skip the header
for row in read_csv:
dict_NER_counts[row[2]] += 1
return dict_NER_counts
output:
{'O': 42123, 'ORG': 2092, 'LOC': 2094, 'MISC': 1268, 'PERS': 3145}
我不知道如何在读取两个 CSV 文件后实现 "if POS in file1 == POS in file2",然后将它们的计数添加到字典中,如上面的代码所示。
- 使用 pandas.read_csv() 将两个文件读取为 csv
- 合并两个数据框
- 分组
POS
并计算行数 - 根据聚合数据框创建字典
代码看起来像这样
import pandas as pd
df1 = pd.read_csv('path to file 1')
df2 = pd.read_csv('path to file 2')
# As column names are same it would be merged on both the columns
df_merged = df1.merge(df2)
#Count occurrences of `WORD` grouped by `POS`
df_merged = df_merged.groupby(['POS']).count().reset_index()
tags_dict = dict(zip(df_merged['POS'], df_merged['WORD']))
希望对您有所帮助!
我觉得有点奇怪,当同一个 WORD 在两个文件中有相同的 POS 时,你称它为两个匹配项而不是一个匹配项——在我看来这只是 一个匹配。
随便...我认为下面会做你想做的事(如果我已经正确理解你想做的事)。
import csv
from collections import defaultdict
def count_tag_matches(filename1, filename2):
"""
Counts the number of tags that had the same value in both CSV files.
"""
dict_counts = defaultdict(int)
with open(filename1, "r", newline='') as csvfile1, \
open(filename2, "r", newline='') as csvfile2:
reader1 = csv.DictReader(csvfile1, delimiter="\t")
reader2 = csv.DictReader(csvfile2, delimiter="\t")
for row1, row2 in zip(reader1, reader2):
if row1['POS'] == row2['POS']:
dict_counts[row1['POS']] += 2
return dict(dict_counts) # Return a regular dictionary.
counts = count_tag_matches('cmp_file1.csv', 'cmp_file2.csv')
print(counts)
处理示例文件的输出:
{'A': 2, 'O': 2, 'PERS': 2}