在 python 中创建代码以从列表中获取最频繁的标签和值对

Question

我有一个包含 3 列的 .txt 文件：单词位置、单词和标签（NN、VB、JJ 等）。

txt 文件示例：

1   i   PRP

2   want    VBP

3   to  TO

4   go  VB

我想在列表中查找单词和标签成对出现的频率，以便找到最常分配给单词的标签。结果示例： 3（食物，NN），2（勇敢，ADJ）

我的想法是从打开文件夹中的文件开始，逐行读取文件并拆分，使用字典设置计数器并按从最常见到不常见的降序打印。

我的代码非常粗糙（我都快post尴尬了）：

file=open("/Users/Desktop/Folder1/trained.txt")
wordcount={}
for word in file.read().split():
    from collections import Counter
    c = Counter()
    for d in dicts.values():
        c += Counter(d)

print(c.most_common())

file.close()

显然，我没有得到任何结果。任何事情都会有所帮助。谢谢。

更新：

所以我得到了这段代码 posted 在这里工作，但我的结果有点奇怪。这是代码（作者删除了它所以我不知道该归功于谁）：

file=open("/Users/Desktop/Folder1/trained.txt").read().split('\n')

d = {}
for i in file:
    if i[1:] in d.keys():
        d[i[1:]] += 1
    else:
        d[i[1:]] = 1

print (sorted(d.items(), key=lambda x: x[1], reverse=True))

这是我的结果：

[('', 15866), ('\t.\t.', 9479), ('\ti\tPRP', 7234), ('\tto\tTO', 4329), ('\tlike\tVB', 2533), ('\tabout\tIN', 2518), ('\tthe\tDT', 2389), ('\tfood\tNN', 2092), ('\ta\tDT', 2053), ('\tme\tPRP', 1870), ('\twant\tVBP', 1713), ('\twould\tMD', 1507), ('0\t.\t.', 1427), ('\teat\tVB', 1390), ('\trestaurant\tNN', 1371), ('\tuh\tUH', 1356), ('1\t.\t.', 1265), ('\ton\tIN', 1237), ("\t'd\tMD", 1221), ('\tyou\tPRP', 1145), ('\thave\tVB', 1127), ('\tis\tVBZ', 1098), ('\ttell\tVB', 1030), ('\tfor\tIN', 987), ('\tdollars\tNNS', 959), ('\tdo\tVBP', 956), ('\tgo\tVB', 931), ('2\t.\t.', 912), ('\trestaurants\tNNS', 899),

似乎混合了带有单词的好结果和带有 space 或随机数的其他结果，有人知道删除非真实单词的方法吗？另外，我知道 \t 应该表示一个制表符，有没有办法删除它？你们帮了大忙

Answer 1

您需要为每个单词单独 collections.Counter。此代码使用 defaultdict 创建一个计数器字典，而不检查每个单词以查看它是否已知。

from collections import Counter, defaultdict

counts = defaultdict(Counter)
for row in file:           # read one line into `row`
    if not row.strip():
        continue           # ignore empty lines
    pos, word, tag = row.split()
    counts[word.lower()][tag] += 1

就是这样，您现在可以检查任何单词的最常见标记：

print(counts["food"].most_common(1))
# Prints [("NN", 3)] or whatever

Answer 2

如果您不介意使用 pandas 这是一个很棒的表格数据库，我会执行以下操作：

import pandas as pd
df = pd.read_csv("/Users/Desktop/Folder1/trained.txt", sep=" ", header=None, names=["position", "word", "tag"])
df["word_tag_counts"] = df.groupby(["word", "tag"]).transform("count")

那么如果你只想要每组中最多的一个，你可以这样做：

df.groupby(["word", "tag"]).max()["word_tag_counts"]

这应该会给你一个 table 和你想要的值

在 python 中创建代码以从列表中获取最频繁的标签和值对

Create a code in python to get the most frequent tag and value pair from a list

python

nlp

pos-tagger

training-data