来自 pandas 列的 Ngram
Ngrams from pandas column
我有一个 pandas 数据框,包含以下列:
第 1 栏
['if', 'you', 'think', 'she', "'s", 'cute', 'now', ',', 'you', 'should', 'have', 'see', 'her', 'a', 'couple', 'of', 'year', 'ago', '.']
['uh', ',', 'yeah', '.', 'just', 'a', 'fax', '.']
第 2 栏
if you think she 's cute now , you should have see her a couple of year ago .
uh , yeah . just a fax .
等等
我的目标是计算数据框的二元组、三元组、四元组(特别是第 2 列,它已经被词形还原)。
我尝试了以下方法:
import nltk
from nltk import bigrams
from nltk import trigrams
trig = trigrams(df ["Column2"])
print (trig)
但是,我有以下错误
<generator object trigrams at 0x0000013C757F1C48>
我的最终目标是能够打印前 X 个双克、三克等
对 split
使用列表推导并首先对所有三元组进行展平:
df = pd.DataFrame({'Column2':["if you think she cute now you if uh yeah just",
'you think she uh yeah just a fax']})
from nltk import trigrams
L = [x for x in df['Column2'] for x in trigrams(x.split())]
print (L)
[('if', 'you', 'think'), ('you', 'think', 'she'), ('think', 'she', 'cute'),
('she', 'cute', 'now'), ('cute', 'now', 'you'), ('now', 'you', 'if'),
('you', 'if', 'uh'), ('if', 'uh', 'yeah'), ('uh', 'yeah', 'just'),
('you', 'think', 'she'), ('think', 'she', 'uh'), ('she', 'uh', 'yeah'),
('uh', 'yeah', 'just'), ('yeah', 'just', 'a'), ('just', 'a', 'fax')]
然后按collections.Counter
计算元组数:
from collections import Counter
c = Counter(L)
print (c)
Counter({('you', 'think', 'she'): 2, ('uh', 'yeah', 'just'): 2, ('if', 'you', 'think'): 1,
('think', 'she', 'cute'): 1, ('she', 'cute', 'now'): 1, ('cute', 'now', 'you'): 1,
('now', 'you', 'if'): 1, ('you', 'if', 'uh'): 1, ('if', 'uh', 'yeah'): 1,
('think', 'she', 'uh'): 1, ('she', 'uh', 'yeah'): 1,
('yeah', 'just', 'a'): 1, ('just', 'a', 'fax'): 1})
对于最高值使用 collections.Counter.most_common
:
top = c.most_common(5)
print (top)
[(('you', 'think', 'she'), 2), (('uh', 'yeah', 'just'), 2),
(('if', 'you', 'think'), 1), (('think', 'she', 'cute'), 1),
(('she', 'cute', 'now'), 1)]
我有一个 pandas 数据框,包含以下列:
第 1 栏
['if', 'you', 'think', 'she', "'s", 'cute', 'now', ',', 'you', 'should', 'have', 'see', 'her', 'a', 'couple', 'of', 'year', 'ago', '.']
['uh', ',', 'yeah', '.', 'just', 'a', 'fax', '.']
第 2 栏
if you think she 's cute now , you should have see her a couple of year ago .
uh , yeah . just a fax .
等等
我的目标是计算数据框的二元组、三元组、四元组(特别是第 2 列,它已经被词形还原)。
我尝试了以下方法:
import nltk
from nltk import bigrams
from nltk import trigrams
trig = trigrams(df ["Column2"])
print (trig)
但是,我有以下错误
<generator object trigrams at 0x0000013C757F1C48>
我的最终目标是能够打印前 X 个双克、三克等
对 split
使用列表推导并首先对所有三元组进行展平:
df = pd.DataFrame({'Column2':["if you think she cute now you if uh yeah just",
'you think she uh yeah just a fax']})
from nltk import trigrams
L = [x for x in df['Column2'] for x in trigrams(x.split())]
print (L)
[('if', 'you', 'think'), ('you', 'think', 'she'), ('think', 'she', 'cute'),
('she', 'cute', 'now'), ('cute', 'now', 'you'), ('now', 'you', 'if'),
('you', 'if', 'uh'), ('if', 'uh', 'yeah'), ('uh', 'yeah', 'just'),
('you', 'think', 'she'), ('think', 'she', 'uh'), ('she', 'uh', 'yeah'),
('uh', 'yeah', 'just'), ('yeah', 'just', 'a'), ('just', 'a', 'fax')]
然后按collections.Counter
计算元组数:
from collections import Counter
c = Counter(L)
print (c)
Counter({('you', 'think', 'she'): 2, ('uh', 'yeah', 'just'): 2, ('if', 'you', 'think'): 1,
('think', 'she', 'cute'): 1, ('she', 'cute', 'now'): 1, ('cute', 'now', 'you'): 1,
('now', 'you', 'if'): 1, ('you', 'if', 'uh'): 1, ('if', 'uh', 'yeah'): 1,
('think', 'she', 'uh'): 1, ('she', 'uh', 'yeah'): 1,
('yeah', 'just', 'a'): 1, ('just', 'a', 'fax'): 1})
对于最高值使用 collections.Counter.most_common
:
top = c.most_common(5)
print (top)
[(('you', 'think', 'she'), 2), (('uh', 'yeah', 'just'), 2),
(('if', 'you', 'think'), 1), (('think', 'she', 'cute'), 1),
(('she', 'cute', 'now'), 1)]