统计某列所有行中2个词组合的频率

Question

我想统计一列所有行中2个单词组合的频率。

我有一个包含两列的 table - 第一列是一个句子，另一列是该句子的二元标记化。

Sentence	words
'beautiful day suffered through '	'beautiful day'
'beautiful day suffered through '	'day suffered'
'beautiful day suffered through '	'suffered through'
'cannot hold back tears '	'cannot hold'
'cannot hold back tears '	'hold back'
'cannot hold back tears '	'back tears'
'ash back tears beautiful day '	'ash back'
'ash back tears beautiful day '	'back tears'
'ash back tears beautiful day '	'tears beautiful'
'ash back tears beautiful day '	'beautiful day'

我想要的输出是一列，计算整个 df['Sentence'] 列中所有句子中单词的出现频率。像这样：

Sentence	Words	Total
'beautiful day suffered through '	'beautiful day'	2
'beautiful day suffered through '	'day suffered'	1
'beautiful day suffered through '	'suffered through'	1
'cannot hold back tears '	'cannot hold'	1
'cannot hold back tears '	'hold back'	1
'cannot hold back tears '	'back tears'	2
'ash back tears beautiful day '	'ash back'	1
'ash back tears beautiful day '	'back tears'	2
'ash back tears beautiful day '	'tears beautiful'	1
'ash back tears beautiful day '	'beautiful day'	2

等等。

我试过的代码重复第一个相同的频率，直到句子结束。

df.Sentence.str.count('|'.join(df.words.tolist()))

所以这不是我要找的东西，而且它也需要很长时间，因为我原来的 df 大得多。

NLTK 或任何其他库中是否有替代方案或功能？

Answer 1

我的理解是，您希望每个独特的句子中包含一个 bi-gram 计数。答案已经存在于单词列中。 value_counts() 用于交付。

df.merge(df['words'].value_counts(), how='left', left_on='words', right_index=True, suffixes=(None,'_total')) 

                           Sentence             words  words_total
0  beautiful day suffered through       beautiful day            2
1  beautiful day suffered through        day suffered            1
2  beautiful day suffered through    suffered through            1
3          cannot hold back tears         cannot hold            1
4          cannot hold back tears           hold back            1
5          cannot hold back tears          back tears            2
6    ash back tears beautiful day            ash back            1
7    ash back tears beautiful day          back tears            2
8    ash back tears beautiful day     tears beautiful            1
9    ash back tears beautiful day       beautiful day            2

Answer 2

我建议：

首先删除 Sentences 和 words

data = data.apply(lambda x: x.str.replace("'", ""))
data["Sentence"] = data["Sentence"].str.strip()
data["words"] = data["words"].str.strip()

然后设置Sentences和words为字符串对象：

data = data.astype({"Sentence":str, "words": str})
print(data)

#Output
                          Sentence            words
0   beautiful day suffered through     beautiful day
1   beautiful day suffered through      day suffered
2   beautiful day suffered through  suffered through
3           cannot hold back tears       cannot hold
4           cannot hold back tears         hold back
5           cannot hold back tears        back tears
6     ash back tears beautiful day          ash back
7     ash back tears beautiful day        back tears
8     ash back tears beautiful day   tears beautiful
9     ash back tears beautiful day     beautiful day

计算同一行句子中给定单词的出现次数并存储在一列中，例如words_occur

def words_in_sent(row):
    return row["Sentence"].count(row["words"])
data["words_occur"] = data.apply(words_in_sent, axis=1)

最后分组 words 并总结它们的出现次数：

data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
print(data)

结果

                          Sentence          words    words_occur total
0   beautiful day suffered through     beautiful day           1     2
1   beautiful day suffered through      day suffered           1     1
2   beautiful day suffered through  suffered through           1     1
3           cannot hold back tears       cannot hold           1     1
4           cannot hold back tears         hold back           1     1
5           cannot hold back tears        back tears           1     2
6     ash back tears beautiful day          ash back           1     1
7     ash back tears beautiful day        back tears           1     2
8     ash back tears beautiful day   tears beautiful           1     1
9     ash back tears beautiful day     beautiful day           1     2

统计某列所有行中2个词组合的频率

Count the frequency of 2 words combination in all the rows of a column

python

count

nltk

pandas