如何使用单词列表计算数据框中的单词数?
How to count word count in Data Frame using word list?
我对使用 python 进行字数统计有疑问。
数据框有三列。(id, text, word)
首先,这是示例 table。
[数据框]
df = pd.DataFrame({
"id":[
"100",
"200",
"300"
],
"text":[
"The best part of Zillow is you can search/view thousands of home within a click of a button without even stepping out of your door.At the comfort of your home you can get all the details such as the floor plan, tax history, neighborhood, mortgage calculator, school ratings etc. and also getting in touch with the contact realtor is just a click away and you are scheduled for the home tour!As a first time home buyer, this website greatly helped me to study the market before making the right choice.",
"I love all of the features of the Zillow app, especially the filtering options and the feature that allows you to save customized searches.",
"Data is not updated spontaneously. Listings are still shown as active while the Mls shows pending or closed."
],
"word":[
"[best, word, door, subway, rain]",
"[item, best, school, store, hospital]",
"[gym, mall, pool, playground]",
]
})
我已经将文本拆分成字典。
所以,我想将每一行单词列表检查为文本。
这就是我想要的结果
| id | word dict |
| -- | ----------------------------------------------- |
| 100| {best: 1, word: 0, door: 1, subway: 0 , rain: 0} |
| 200| {item: 0, best: 0, school: 0, store: 0, hospital: 0} |
| 300| {gym: 0, mall: 0, pool: 0, playground: 0} |
请检查此问题。
我们可以使用 re
来提取 list
中的所有单词。请注意,这只会匹配您列表中的单词,而不是数字。
然后应用一个函数,该函数 returns a dict
与列表中每个单词的计数。然后我们可以将此函数应用于 df
.
中的新列
import re
def count_words(row):
words = re.findall(r'(\w+)', row['word'])
return {word: row['text'].count(word) for word in words}
df['word_counts'] = df.apply(lambda x: count_words(x), axis=1)
产出
id ... word_counts
0 100 ... {'best': 1, 'word': 0, 'door': 1, 'subway': 0,...
1 200 ... {'item': 0, 'best': 0, 'school': 0, 'store': 0...
2 300 ... {'gym': 0, 'mall': 0, 'pool': 0, 'playground': 0}
[3 rows x 4 columns]
由于你的word列是string类型,先转成list:
df['word'] = df['word'].str[1:-1].str.split(',')
现在您可以使用 apply for axis=1
和计算每个单词的逻辑:
df[['text', 'word']].apply(lambda row: {item:row['text'].count(item) for item in row['word']}, axis=1)
输出:
Out[32]:
0 {'best': 1, ' word': 0, ' door': 1, ' subway':...
1 {'item': 0, ' best': 0, ' school': 0, ' store'...
2 {'gym': 0, ' mall': 0, ' pool': 0, ' playgroun...
dtype: object
我对使用 python 进行字数统计有疑问。
数据框有三列。(id, text, word)
首先,这是示例 table。
[数据框]
df = pd.DataFrame({
"id":[
"100",
"200",
"300"
],
"text":[
"The best part of Zillow is you can search/view thousands of home within a click of a button without even stepping out of your door.At the comfort of your home you can get all the details such as the floor plan, tax history, neighborhood, mortgage calculator, school ratings etc. and also getting in touch with the contact realtor is just a click away and you are scheduled for the home tour!As a first time home buyer, this website greatly helped me to study the market before making the right choice.",
"I love all of the features of the Zillow app, especially the filtering options and the feature that allows you to save customized searches.",
"Data is not updated spontaneously. Listings are still shown as active while the Mls shows pending or closed."
],
"word":[
"[best, word, door, subway, rain]",
"[item, best, school, store, hospital]",
"[gym, mall, pool, playground]",
]
})
我已经将文本拆分成字典。
所以,我想将每一行单词列表检查为文本。
这就是我想要的结果
| id | word dict |
| -- | ----------------------------------------------- |
| 100| {best: 1, word: 0, door: 1, subway: 0 , rain: 0} |
| 200| {item: 0, best: 0, school: 0, store: 0, hospital: 0} |
| 300| {gym: 0, mall: 0, pool: 0, playground: 0} |
请检查此问题。
我们可以使用 re
来提取 list
中的所有单词。请注意,这只会匹配您列表中的单词,而不是数字。
然后应用一个函数,该函数 returns a dict
与列表中每个单词的计数。然后我们可以将此函数应用于 df
.
import re
def count_words(row):
words = re.findall(r'(\w+)', row['word'])
return {word: row['text'].count(word) for word in words}
df['word_counts'] = df.apply(lambda x: count_words(x), axis=1)
产出
id ... word_counts
0 100 ... {'best': 1, 'word': 0, 'door': 1, 'subway': 0,...
1 200 ... {'item': 0, 'best': 0, 'school': 0, 'store': 0...
2 300 ... {'gym': 0, 'mall': 0, 'pool': 0, 'playground': 0}
[3 rows x 4 columns]
由于你的word列是string类型,先转成list:
df['word'] = df['word'].str[1:-1].str.split(',')
现在您可以使用 apply for axis=1
和计算每个单词的逻辑:
df[['text', 'word']].apply(lambda row: {item:row['text'].count(item) for item in row['word']}, axis=1)
输出:
Out[32]:
0 {'best': 1, ' word': 0, ' door': 1, ' subway':...
1 {'item': 0, ' best': 0, ' school': 0, ' store'...
2 {'gym': 0, ' mall': 0, ' pool': 0, ' playgroun...
dtype: object