为 txt 文件创建签名

Question

我正在编写一个代码，用于计算文本文件中的所有单词，并且应该对文本文件中出现次数最多的前 25 个单词（称为签名）进行排序，然后将其存储在列表中.接下来，应该使用 Jaccard 相似性度量来比较签名。我有 Jaccard similarity 的代码，但我需要为我的程序修改它，因为我是从另一个例子中获取的。创建签名的代码给我这个错误：列 'Prophet' 有 dtype 对象，不能对这种类型

使用方法 'nlargest'

I was doing research on ways to how I can sort all the words by the top 25 most common ones and this is the most efficient way but it's giving me this error. Is there another way that I can sort this out? Also, how would I then add the 25 words into a new list for each text? Any feedback is greatly appreciated. Please show the change in code. Thanks in advance!

Answer 1

错误原因：

nlargest 无法作用于 object 列，例如字符串列，因为在您的 df 中至少有一个列（"Word" 除外）是一个 object 列，因此它在该列引发了错误。

错误修正：

print(df.dtypes)检查每一列的数据类型，要么不在object列上应用nlargest，要么将object列转换为其他类型（例如如 int, float).

how I can sort all the words by the top 25 most common ones and this is the most efficient way ... Is there another way that I can sort this out?

我认为nlargest应该是你的选择，只要确保你应用的列nlargest是相应单词的计数 .此外，如果您有这样的列，您实际上不需要使用循环来遍历所有列，因为您只需要在该特定列上执行 nlargest。

Also, how would I then add the 25 words into a new list for each text?

参考这个例子

df = pd.DataFrame({
    'Word': ['happy', 'hello', 'you', 'he', 'she', 'it'],
    'Count': [27, 32,19, 6,80, 5]
})

largest = df.set_index('Word')['Count'].nlargest(3)
print(largest)

输出：

Word
she      80
hello    32
happy    27
Name: Count, dtype: int64

你会得到一个pd.Series的(3)个最大的计数，对应的词作为索引。然后你可以提取索引并通过

将其转换为列表

largest.index.tolist()

最后，如果您的数据框的列格式与我上面的示例不同，您最好将您的数据框转换为我的数据框。如果您不确定如何转换，您需要在这里分享一个子数据框以供我们查看。您可以通过 print(df.head(10).to_dict('list')) 导出数据框的前 10 行，并将结果作为文本粘贴到您的问题中。

为 txt 文件创建签名

Signature Creating for txt files

python

similarity

dataframe

pandas