为 Pandas 数据框中的每一行获取最受欢迎的三元组
Get the Most Popular Trigrams for Each Row in a Pandas Dataframe
我是 python 的新手,正在尝试从名为 ['Question'] 的列中获取 Pandas 数据框中每一行最流行的三元组列表。
我已经接近我的需要,但我无法获得行级别的流行度计数。理想情况下,我只想将 ngram 的最小频率保持在 1 左右。
最小可重现示例:
import pandas as pd import nltk
data = {
"question": [
"The quick brown fox jumps over the lazy dog",
"Waltz, bad nymph, for quick jigs vex.",
"Glib jocks quiz nymph to vex dwarf.",
"Sphinx of black quartz, judge my vow.",
"How vexingly quick daft zebras jump!",
] }
df = pd.DataFrame(data)
df["bigrams"] = df['question'].apply(lambda row: list(nltk.bigrams(row.split(' '))))
print(df)
当前输出:
question bigrams
0 The quick brown fox jumps over the lazy dog [(The, quick), (quick, brown), (brown, fox), (...
1 Waltz, bad nymph, for quick jigs vex. [(Waltz,, bad), (bad, nymph,), (nymph,, for), ...
2 Glib jocks quiz nymph to vex dwarf. [(Glib, jocks), (jocks, quiz), (quiz, nymph), ...
3 Sphinx of black quartz, judge my vow. [(Sphinx, of), (of, black), (black, quartz,), ...
4 How vexingly quick daft zebras jump! [(How, vexingly), (vexingly, quick), (quick, d...
期望的输出:(或接近它 - 我不确定如何最好地表示频率计数!)
question bigrams
0 The quick brown fox jumps over the lazy dog [(The, quick,1), (quick, brown,1), (brown, fox), (...
1 Waltz, bad nymph, for quick jigs vex. [(Waltz,, bad,1), (bad, nymph,2), (nymph,, for), ...
1 Glib jocks quiz nymph to vex dwarf. [(Glib, jocks,1), (jocks,quiz,2),
1 Sphinx of black quartz, judge my vow. [(Sphinx, of,1), (of, black,2), (black, quartz,), ...
1 How vexingly quick daft zebras jump! [(How, vexingly.1), (vexingly, quick,1), (quick, d...
输入数据(出于演示目的,所有字符串都已清理):
data = ["she wants to sing she wants to act she wants to dance",
"if you sing I will smile if you laugh I will smile if you love I will smile"]
df = pd.DataFrame({"question": data})
使用nltk.FreqDist
计算二元组的频率分布:
bigram_freq = lambda s: list(nltk.FreqDist(nltk.bigrams(s.split(" "))).items())
out = df['question'].apply(bigram_freq).explode()
out = pd.DataFrame(out.to_list(), index=out.index, columns=["question", "bigrams"])
输出结果:
>>> out
question bigrams
0 (she, wants) 3
0 (wants, to) 3
0 (to, sing) 1
0 (sing, she) 1
0 (to, act) 1
0 (act, she) 1
0 (to, dance) 1
1 (if, you) 3
1 (you, sing) 1
1 (sing, I) 1
1 (I, will) 3
1 (will, smile) 3
1 (smile, if) 2
1 (you, laugh) 1
1 (laugh, I) 1
1 (you, love) 1
1 (love, I) 1
我是 python 的新手,正在尝试从名为 ['Question'] 的列中获取 Pandas 数据框中每一行最流行的三元组列表。
我已经接近我的需要,但我无法获得行级别的流行度计数。理想情况下,我只想将 ngram 的最小频率保持在 1 左右。
最小可重现示例:
import pandas as pd import nltk
data = {
"question": [
"The quick brown fox jumps over the lazy dog",
"Waltz, bad nymph, for quick jigs vex.",
"Glib jocks quiz nymph to vex dwarf.",
"Sphinx of black quartz, judge my vow.",
"How vexingly quick daft zebras jump!",
] }
df = pd.DataFrame(data)
df["bigrams"] = df['question'].apply(lambda row: list(nltk.bigrams(row.split(' '))))
print(df)
当前输出:
question bigrams
0 The quick brown fox jumps over the lazy dog [(The, quick), (quick, brown), (brown, fox), (...
1 Waltz, bad nymph, for quick jigs vex. [(Waltz,, bad), (bad, nymph,), (nymph,, for), ...
2 Glib jocks quiz nymph to vex dwarf. [(Glib, jocks), (jocks, quiz), (quiz, nymph), ...
3 Sphinx of black quartz, judge my vow. [(Sphinx, of), (of, black), (black, quartz,), ...
4 How vexingly quick daft zebras jump! [(How, vexingly), (vexingly, quick), (quick, d...
期望的输出:(或接近它 - 我不确定如何最好地表示频率计数!)
question bigrams
0 The quick brown fox jumps over the lazy dog [(The, quick,1), (quick, brown,1), (brown, fox), (...
1 Waltz, bad nymph, for quick jigs vex. [(Waltz,, bad,1), (bad, nymph,2), (nymph,, for), ...
1 Glib jocks quiz nymph to vex dwarf. [(Glib, jocks,1), (jocks,quiz,2),
1 Sphinx of black quartz, judge my vow. [(Sphinx, of,1), (of, black,2), (black, quartz,), ...
1 How vexingly quick daft zebras jump! [(How, vexingly.1), (vexingly, quick,1), (quick, d...
输入数据(出于演示目的,所有字符串都已清理):
data = ["she wants to sing she wants to act she wants to dance",
"if you sing I will smile if you laugh I will smile if you love I will smile"]
df = pd.DataFrame({"question": data})
使用nltk.FreqDist
计算二元组的频率分布:
bigram_freq = lambda s: list(nltk.FreqDist(nltk.bigrams(s.split(" "))).items())
out = df['question'].apply(bigram_freq).explode()
out = pd.DataFrame(out.to_list(), index=out.index, columns=["question", "bigrams"])
输出结果:
>>> out
question bigrams
0 (she, wants) 3
0 (wants, to) 3
0 (to, sing) 1
0 (sing, she) 1
0 (to, act) 1
0 (act, she) 1
0 (to, dance) 1
1 (if, you) 3
1 (you, sing) 1
1 (sing, I) 1
1 (I, will) 3
1 (will, smile) 3
1 (smile, if) 2
1 (you, laugh) 1
1 (laugh, I) 1
1 (you, love) 1
1 (love, I) 1