是否可以从 Pandas Profiling 中获取详细的词频列表？

Question

我目前正在处理一大批文件，这些文件需要我检查某些字符串的频率。我的第一个想法是将所有文件导入到一个数据集中，并使用 for 循环使用以下代码检查所有文件中的字符串。

 # Define an empty dataframe to append all imported files to
df = pd.DataFrame()
new_list = []

# If text file is import successfully append the resulting dataframe to df. If an exception occurs append "None" instead.
# "`" was chosen as the delimiter to ensure that each file is saved to a single row.
for i in file_list: 
    try: df_1 = pd.read_csv(f"D:/Admin/3. OCR files/OCR_Translations/{i}", delimiter = "`") 
    df = df.append(df_1) new_list.append(f"D:/Admin/3. OCR files/OCR_Translations/{i}") 
except: 
    df = df.append(["None"])                
    new_list.append("None")

df = df.T.reset_index()

# Search the dataset for the required keyword
count = 0

for i in df["index"]:
    if "Keyword1" in i:
        count += 1

这最终失败了，因为这些文件中的字符串将被正确拼写的保证绝对为零，因为有问题的文件是由 OCR 程序生成的（该程序和有问题的文件都是泰语）。

Pandas 分析生成的正是我手头工作所需的内容，只是它没有给出完整列表，如此处所示 link (https://imgur.com/xxf1Qnx). Is there a way to get the full list of word frequencies from Pandas Profiling? I've tried checking pandas_profiling documentation (https://pandas-profiling.github.io/pandas-profiling/docs/master/index.html)如果有什么我可以做的，到目前为止我还没有在这里看到任何与我的用例有关的东西。

Answer 1

您~~可能不需要~~不需要Pandas来计算文件中的单词出现次数。

import collections

word_counter = collections.Counter()

for i in file_list:
    with open(f"D:/Admin/3. OCR files/OCR_Translations/{i}") as f:
        for line in f:
            words = line.strip().split()  # Split line by whitespaces.
            word_counter.update(words)  # Update counter with occurrences.


print(word_counter)

您可能还对计数器上的 .most_common() 方法感兴趣。

此外，如果你真的需要，你也可以把Counter变成一个dataframe；只是一个有特效的字典。

是否可以从 Pandas Profiling 中获取详细的词频列表？

Is it possible to get a detailed list of word frequencies from Pandas Profiling?

python

pandas

pandas-profiling