如何比较 2 个 CSV 文件,检查第二列的值是否匹配并计算每个值匹配时出现的次数?

How can I compare 2 CSV files, check if the values of the second column match and count the number of occurrences for each value when they match?

我想遍历 2 个 CSV 文件,检查两个文件中的值何时匹配,并计算每个值匹配时出现的次数。输出应该是一个字典。

所以我有两个对齐的 CSV 文件。每个都有 2 列:"WORD" 和 "POS"(词性标记)。 Click to see example of file 1 Click to see example of file 2

在某些情况下,两个文件的每个单词都以相同的方式标记,但在许多其他情况下则不同。我想计算两个文件以相同方式标记的次数。

例如,如果 file1 有 WORD "human" 和 POS "PERS",file2 也有 WORD "human" 和 POS "PERS",我希望输出为:{PERS:2} 这意味着 PERS 在两个文件中匹配了两次。我希望每个标签都这样: {TAG1:它出现 n 次并匹配两者,TAG2:它出现并匹配两者的次数,等等}

我只能弄清楚如何读取 一个 CSV 文件 并使用以下代码计算每个 POS 标签的使用次数:

import csv 
from collections import defaultdict


def count_NER_tags(filename): 
    """
    Obtains the counts of each tag for the determined csv file  

    """

    dict_NER_counts = defaultdict(int) 

    with open(filename, "r") as csvfile:
        read_csv = csv.reader(csvfile, delimiter="\t")
        next(read_csv) #skip the header 
        for row in read_csv:
             dict_NER_counts[row[2]] += 1

        return dict_NER_counts


output: 
{'O': 42123, 'ORG': 2092, 'LOC': 2094, 'MISC': 1268, 'PERS': 3145}

我不知道如何在读取两个 CSV 文件后实现 "if POS in file1 == POS in file2",然后将它们的计数添加到字典中,如上面的代码所示。

  1. 使用 pandas.read_csv()
  2. 将两个文件读取为 csv
  3. 合并两个数据框
  4. 分组 POS 并计算行数
  5. 根据聚合数据框创建字典

代码看起来像这样

import pandas as pd

df1 = pd.read_csv('path to file 1')
df2 = pd.read_csv('path to file 2')

# As column names are same it would be merged on both the columns
df_merged = df1.merge(df2)

#Count occurrences of `WORD` grouped by `POS`
df_merged = df_merged.groupby(['POS']).count().reset_index()

tags_dict = dict(zip(df_merged['POS'], df_merged['WORD']))

希望对您有所帮助!

我觉得有点奇怪,当同一个 WORD 在两个文件中有相同的 POS 时,你称它为两个匹配项而不是一个匹配项——在我看来这只是 一个匹配。

随便...我认为下面会做你想做的事(如果我已经正确理解你想做的事)。

import csv
from collections import defaultdict

def count_tag_matches(filename1, filename2):
    """
    Counts the number of tags that had the same value in both CSV files.
    """
    dict_counts = defaultdict(int)

    with open(filename1, "r", newline='') as csvfile1, \
         open(filename2, "r", newline='') as csvfile2:

        reader1 = csv.DictReader(csvfile1, delimiter="\t")
        reader2 = csv.DictReader(csvfile2, delimiter="\t")

        for row1, row2 in zip(reader1, reader2):
             if row1['POS'] == row2['POS']:
                 dict_counts[row1['POS']] += 2

    return dict(dict_counts)  # Return a regular dictionary.

counts = count_tag_matches('cmp_file1.csv', 'cmp_file2.csv')
print(counts)

处理示例文件的输出:

{'A': 2, 'O': 2, 'PERS': 2}