使用 pandas(python)计算数据框中的标记化单词
Counting tokenized words in data frame with pandas ( python)
我在 Python
的数据框中创建了一个标记化数据(文本)
我只想对标记化数据进行计数,并得到一个显示标记化数据中每个元素的重复频率的输出。
这是我用来创建标记化数据的代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
def tokenize(txt):
tokens = re.split('\W+', txt)
return tokens
Complains['clean_text_tokenized'] = Complains['clean text'].apply(lambda x: tokenize(x.lower()))
# Complains['clean text'] is the original file of the data
Complains['clean_text_tokenized'].head(10)
这是标记化数据的输出
0 [comcast, cable, internet, speeds]
1 [payment, disappear, service, got, disconnected]
2 [speed, and, service]
3 [comcast, imposed, a, new, usage, cap, of, 300...
4 [comcast, not, working, and, no, service, to, ...
5 [isp, charging, for, arbitrary, data, limits, ...
6 [throttling, service, and, unreasonable, data,...
7 [comcast, refuses, to, help, troubleshoot, and...
8 [comcast, extended, outages]
9 [comcast, raising, prices, and, not, being, av...
Name: clean_text_tokenized, dtype: object
任何建议都会有帮助
您可以使用 Counter
:
from collections import Counter
# ... and then
def tokenize(txt):
return Counter(re.split('\W+', txt))
查看 Python 测试:
from collections import Counter
import pandas as pd
import re
Complains = pd.DataFrame({'clean text':['comcast, cable, internet, speeds', 'payment, disappear, service, got, disconnected']})
Complains['clean_text_tokenized'] = Complains['clean text'].str.findall(r'\w+')
freq = Counter([item for sublist in Complains['clean_text_tokenized'].to_list() for item in sublist])
我在 Python
的数据框中创建了一个标记化数据(文本)我只想对标记化数据进行计数,并得到一个显示标记化数据中每个元素的重复频率的输出。
这是我用来创建标记化数据的代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
def tokenize(txt):
tokens = re.split('\W+', txt)
return tokens
Complains['clean_text_tokenized'] = Complains['clean text'].apply(lambda x: tokenize(x.lower()))
# Complains['clean text'] is the original file of the data
Complains['clean_text_tokenized'].head(10)
这是标记化数据的输出
0 [comcast, cable, internet, speeds]
1 [payment, disappear, service, got, disconnected]
2 [speed, and, service]
3 [comcast, imposed, a, new, usage, cap, of, 300...
4 [comcast, not, working, and, no, service, to, ...
5 [isp, charging, for, arbitrary, data, limits, ...
6 [throttling, service, and, unreasonable, data,...
7 [comcast, refuses, to, help, troubleshoot, and...
8 [comcast, extended, outages]
9 [comcast, raising, prices, and, not, being, av...
Name: clean_text_tokenized, dtype: object
任何建议都会有帮助
您可以使用 Counter
:
from collections import Counter
# ... and then
def tokenize(txt):
return Counter(re.split('\W+', txt))
查看 Python 测试:
from collections import Counter
import pandas as pd
import re
Complains = pd.DataFrame({'clean text':['comcast, cable, internet, speeds', 'payment, disappear, service, got, disconnected']})
Complains['clean_text_tokenized'] = Complains['clean text'].str.findall(r'\w+')
freq = Counter([item for sublist in Complains['clean_text_tokenized'].to_list() for item in sublist])