统计某列所有行中2个词组合的频率
Count the frequency of 2 words combination in all the rows of a column
我想统计一列所有行中2个单词组合的频率。
我有一个包含两列的 table - 第一列是一个句子,另一列是该句子的二元标记化。
Sentence
words
'beautiful day suffered through '
'beautiful day'
'beautiful day suffered through '
'day suffered'
'beautiful day suffered through '
'suffered through'
'cannot hold back tears '
'cannot hold'
'cannot hold back tears '
'hold back'
'cannot hold back tears '
'back tears'
'ash back tears beautiful day '
'ash back'
'ash back tears beautiful day '
'back tears'
'ash back tears beautiful day '
'tears beautiful'
'ash back tears beautiful day '
'beautiful day'
我想要的输出是一列,计算整个 df['Sentence'] 列中所有句子中单词的出现频率。
像这样:
Sentence
Words
Total
'beautiful day suffered through '
'beautiful day'
2
'beautiful day suffered through '
'day suffered'
1
'beautiful day suffered through '
'suffered through'
1
'cannot hold back tears '
'cannot hold'
1
'cannot hold back tears '
'hold back'
1
'cannot hold back tears '
'back tears'
2
'ash back tears beautiful day '
'ash back'
1
'ash back tears beautiful day '
'back tears'
2
'ash back tears beautiful day '
'tears beautiful'
1
'ash back tears beautiful day '
'beautiful day'
2
等等。
我试过的代码重复第一个相同的频率,直到句子结束。
df.Sentence.str.count('|'.join(df.words.tolist()))
所以这不是我要找的东西,而且它也需要很长时间,因为我原来的 df 大得多。
NLTK 或任何其他库中是否有替代方案或功能?
我的理解是,您希望每个独特的句子中包含一个 bi-gram 计数。答案已经存在于单词列中。 value_counts()
用于交付。
df.merge(df['words'].value_counts(), how='left', left_on='words', right_index=True, suffixes=(None,'_total'))
Sentence words words_total
0 beautiful day suffered through beautiful day 2
1 beautiful day suffered through day suffered 1
2 beautiful day suffered through suffered through 1
3 cannot hold back tears cannot hold 1
4 cannot hold back tears hold back 1
5 cannot hold back tears back tears 2
6 ash back tears beautiful day ash back 1
7 ash back tears beautiful day back tears 2
8 ash back tears beautiful day tears beautiful 1
9 ash back tears beautiful day beautiful day 2
我建议:
- 首先删除
Sentences
和 words
开头和结尾的引号和空格
data = data.apply(lambda x: x.str.replace("'", ""))
data["Sentence"] = data["Sentence"].str.strip()
data["words"] = data["words"].str.strip()
- 然后设置
Sentences
和words
为字符串对象:
data = data.astype({"Sentence":str, "words": str})
print(data)
#Output
Sentence words
0 beautiful day suffered through beautiful day
1 beautiful day suffered through day suffered
2 beautiful day suffered through suffered through
3 cannot hold back tears cannot hold
4 cannot hold back tears hold back
5 cannot hold back tears back tears
6 ash back tears beautiful day ash back
7 ash back tears beautiful day back tears
8 ash back tears beautiful day tears beautiful
9 ash back tears beautiful day beautiful day
- 计算同一行句子中给定单词的出现次数并存储在一列中,例如
words_occur
def words_in_sent(row):
return row["Sentence"].count(row["words"])
data["words_occur"] = data.apply(words_in_sent, axis=1)
- 最后分组
words
并总结它们的出现次数:
data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
print(data)
结果
Sentence words words_occur total
0 beautiful day suffered through beautiful day 1 2
1 beautiful day suffered through day suffered 1 1
2 beautiful day suffered through suffered through 1 1
3 cannot hold back tears cannot hold 1 1
4 cannot hold back tears hold back 1 1
5 cannot hold back tears back tears 1 2
6 ash back tears beautiful day ash back 1 1
7 ash back tears beautiful day back tears 1 2
8 ash back tears beautiful day tears beautiful 1 1
9 ash back tears beautiful day beautiful day 1 2
我想统计一列所有行中2个单词组合的频率。
我有一个包含两列的 table - 第一列是一个句子,另一列是该句子的二元标记化。
Sentence | words |
---|---|
'beautiful day suffered through ' | 'beautiful day' |
'beautiful day suffered through ' | 'day suffered' |
'beautiful day suffered through ' | 'suffered through' |
'cannot hold back tears ' | 'cannot hold' |
'cannot hold back tears ' | 'hold back' |
'cannot hold back tears ' | 'back tears' |
'ash back tears beautiful day ' | 'ash back' |
'ash back tears beautiful day ' | 'back tears' |
'ash back tears beautiful day ' | 'tears beautiful' |
'ash back tears beautiful day ' | 'beautiful day' |
我想要的输出是一列,计算整个 df['Sentence'] 列中所有句子中单词的出现频率。 像这样:
Sentence | Words | Total |
---|---|---|
'beautiful day suffered through ' | 'beautiful day' | 2 |
'beautiful day suffered through ' | 'day suffered' | 1 |
'beautiful day suffered through ' | 'suffered through' | 1 |
'cannot hold back tears ' | 'cannot hold' | 1 |
'cannot hold back tears ' | 'hold back' | 1 |
'cannot hold back tears ' | 'back tears' | 2 |
'ash back tears beautiful day ' | 'ash back' | 1 |
'ash back tears beautiful day ' | 'back tears' | 2 |
'ash back tears beautiful day ' | 'tears beautiful' | 1 |
'ash back tears beautiful day ' | 'beautiful day' | 2 |
等等。
我试过的代码重复第一个相同的频率,直到句子结束。
df.Sentence.str.count('|'.join(df.words.tolist()))
所以这不是我要找的东西,而且它也需要很长时间,因为我原来的 df 大得多。
NLTK 或任何其他库中是否有替代方案或功能?
我的理解是,您希望每个独特的句子中包含一个 bi-gram 计数。答案已经存在于单词列中。 value_counts()
用于交付。
df.merge(df['words'].value_counts(), how='left', left_on='words', right_index=True, suffixes=(None,'_total'))
Sentence words words_total
0 beautiful day suffered through beautiful day 2
1 beautiful day suffered through day suffered 1
2 beautiful day suffered through suffered through 1
3 cannot hold back tears cannot hold 1
4 cannot hold back tears hold back 1
5 cannot hold back tears back tears 2
6 ash back tears beautiful day ash back 1
7 ash back tears beautiful day back tears 2
8 ash back tears beautiful day tears beautiful 1
9 ash back tears beautiful day beautiful day 2
我建议:
- 首先删除
Sentences
和words
开头和结尾的引号和空格
data = data.apply(lambda x: x.str.replace("'", ""))
data["Sentence"] = data["Sentence"].str.strip()
data["words"] = data["words"].str.strip()
- 然后设置
Sentences
和words
为字符串对象:
data = data.astype({"Sentence":str, "words": str})
print(data)
#Output
Sentence words
0 beautiful day suffered through beautiful day
1 beautiful day suffered through day suffered
2 beautiful day suffered through suffered through
3 cannot hold back tears cannot hold
4 cannot hold back tears hold back
5 cannot hold back tears back tears
6 ash back tears beautiful day ash back
7 ash back tears beautiful day back tears
8 ash back tears beautiful day tears beautiful
9 ash back tears beautiful day beautiful day
- 计算同一行句子中给定单词的出现次数并存储在一列中,例如
words_occur
def words_in_sent(row):
return row["Sentence"].count(row["words"])
data["words_occur"] = data.apply(words_in_sent, axis=1)
- 最后分组
words
并总结它们的出现次数:
data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
print(data)
结果
Sentence words words_occur total
0 beautiful day suffered through beautiful day 1 2
1 beautiful day suffered through day suffered 1 1
2 beautiful day suffered through suffered through 1 1
3 cannot hold back tears cannot hold 1 1
4 cannot hold back tears hold back 1 1
5 cannot hold back tears back tears 1 2
6 ash back tears beautiful day ash back 1 1
7 ash back tears beautiful day back tears 1 2
8 ash back tears beautiful day tears beautiful 1 1
9 ash back tears beautiful day beautiful day 1 2