Pre-process 带有 NLTK 的文本字符串
Pre-process text string with NLTK
我有一个包含 docid(文档 ID)、title(文章标题)、lineid(行 ID,也就是段落的位置)、文本和 tokencount(包括空格的单词数)的数据框 A :
docid title lineid text tokencount
0 0 A 0 shopping and orders have become more com... 66
1 0 A 1 people wrote to the postal service online... 67
2 0 A 2 text updates really from the U.S. Postal... 43
...
我想创建一个基于 A 的新数据框,包括 title
、lineid
、count
和 query
。
query
是包含“数据分析”、“短信”、“购物和订单”等一个或多个词的文本字符串。
count
是query
.
每个单词的计数
新数据框应如下所示:
title lemma count lineid
A "data" 0 0
A "data" 1 1
A "data" 4 2
A "shop" 2 0
A "shop" 1 1
A "shop" 2 2
B "data" 4 0
B "data" 0 1
B "data" 2 2
B "shop" 9 0
B "shop" 3 1
B "shop" 1 2
...
如何创建一个函数来生成这个新的数据帧?
我从 A 创建了一个新的数据框 df
,其中有一列 count
。
df = A[['title','lineid']]
df['count'] = 0
df.set_index(['title','lineid'], inplace=True)
此外,我还创建了一个函数来计算查询的字数。
from collections import Counter
def occurrence_counter(target_string, query):
data = dict(Counter(target_string.split()))
count = 0
for key in query:
if key in data:
count += data[key]
return count
但是,我如何使用它们来生成新数据框的函数?
这应该可以处理您的情况:
import pandas as pd
from collections import Counter
query = "data analysis"
wordlist = query.split(" ")
#print(wordlist)
# row wise frequency count
df['text_new'] = df.text.str.split().apply(lambda x: Counter(x))
output = pd.DataFrame()
# iterate row by row
for index, row in df.iterrows():
temp = dict()
for word in wordlist:
temp['title'] = row['title']
temp['lemma'] = word
temp['count'] = row['text_new'][word]
temp['lineid'] = row['lineid']
output = output.append(temp, ignore_index=True)
#print(output)
如果我理解正确,您可以使用内置的 pandas 函数来完成此操作:Series.str.count()
to count the queries
; melt()
重塑最终的列结构。
给定样本 df
:
df = pd.DataFrame({'docid': {0: 0, 1: 0, 2: 0}, 'title': {0: 'A', 1: 'A', 2: 'A'}, 'lineid': {0: 0, 1: 1, 2: 2}, 'text': {0: 'shopping and orders have become more com...', 1: 'people wrote to the postal service online...', 2: 'text updates really from the U.S. Postal...'}, 'tokencount': {0: 66, 1: 67, 2: 43}})
# docid title lineid text
# 0 0 A 0 shopping and orders have become more com...
# 1 0 A 1 people wrote to the postal service online...
# 2 0 A 2 text updates really from the U.S. Postal...
第一count()
queries
:
queries = ['order', 'shop', 'text']
df = df.assign(**{f'query_{query}': df.text.str.count(query) for query in queries})
# docid title lineid text tokencount query_order query_shop query_text
# 0 0 A 0 shopping and orders have become more com... 66 1 1 0
# 1 0 A 1 people wrote to the postal service online... 67 0 0 0
# 2 0 A 2 text updates really from the U.S. Postal... 43 0 0 1
然后melt()
进入最后的列结构:
df.melt(
id_vars=['title', 'lineid'],
value_vars=[f'query_{query}' for query in queries],
var_name='lemma',
value_name='count',
).replace(r'^query_', '', regex=True)
# title lineid lemma count
# 0 A 0 order 1
# 1 A 1 order 0
# 2 A 2 order 0
# 3 A 0 shop 1
# 4 A 1 shop 0
# 5 A 2 shop 0
# 6 A 0 text 0
# 7 A 1 text 0
# 8 A 2 text 1
我有一个包含 docid(文档 ID)、title(文章标题)、lineid(行 ID,也就是段落的位置)、文本和 tokencount(包括空格的单词数)的数据框 A :
docid title lineid text tokencount
0 0 A 0 shopping and orders have become more com... 66
1 0 A 1 people wrote to the postal service online... 67
2 0 A 2 text updates really from the U.S. Postal... 43
...
我想创建一个基于 A 的新数据框,包括 title
、lineid
、count
和 query
。
query
是包含“数据分析”、“短信”、“购物和订单”等一个或多个词的文本字符串。
count
是query
.
新数据框应如下所示:
title lemma count lineid
A "data" 0 0
A "data" 1 1
A "data" 4 2
A "shop" 2 0
A "shop" 1 1
A "shop" 2 2
B "data" 4 0
B "data" 0 1
B "data" 2 2
B "shop" 9 0
B "shop" 3 1
B "shop" 1 2
...
如何创建一个函数来生成这个新的数据帧?
我从 A 创建了一个新的数据框 df
,其中有一列 count
。
df = A[['title','lineid']]
df['count'] = 0
df.set_index(['title','lineid'], inplace=True)
此外,我还创建了一个函数来计算查询的字数。
from collections import Counter
def occurrence_counter(target_string, query):
data = dict(Counter(target_string.split()))
count = 0
for key in query:
if key in data:
count += data[key]
return count
但是,我如何使用它们来生成新数据框的函数?
这应该可以处理您的情况:
import pandas as pd
from collections import Counter
query = "data analysis"
wordlist = query.split(" ")
#print(wordlist)
# row wise frequency count
df['text_new'] = df.text.str.split().apply(lambda x: Counter(x))
output = pd.DataFrame()
# iterate row by row
for index, row in df.iterrows():
temp = dict()
for word in wordlist:
temp['title'] = row['title']
temp['lemma'] = word
temp['count'] = row['text_new'][word]
temp['lineid'] = row['lineid']
output = output.append(temp, ignore_index=True)
#print(output)
如果我理解正确,您可以使用内置的 pandas 函数来完成此操作:Series.str.count()
to count the queries
; melt()
重塑最终的列结构。
给定样本 df
:
df = pd.DataFrame({'docid': {0: 0, 1: 0, 2: 0}, 'title': {0: 'A', 1: 'A', 2: 'A'}, 'lineid': {0: 0, 1: 1, 2: 2}, 'text': {0: 'shopping and orders have become more com...', 1: 'people wrote to the postal service online...', 2: 'text updates really from the U.S. Postal...'}, 'tokencount': {0: 66, 1: 67, 2: 43}})
# docid title lineid text
# 0 0 A 0 shopping and orders have become more com...
# 1 0 A 1 people wrote to the postal service online...
# 2 0 A 2 text updates really from the U.S. Postal...
第一count()
queries
:
queries = ['order', 'shop', 'text']
df = df.assign(**{f'query_{query}': df.text.str.count(query) for query in queries})
# docid title lineid text tokencount query_order query_shop query_text
# 0 0 A 0 shopping and orders have become more com... 66 1 1 0
# 1 0 A 1 people wrote to the postal service online... 67 0 0 0
# 2 0 A 2 text updates really from the U.S. Postal... 43 0 0 1
然后melt()
进入最后的列结构:
df.melt(
id_vars=['title', 'lineid'],
value_vars=[f'query_{query}' for query in queries],
var_name='lemma',
value_name='count',
).replace(r'^query_', '', regex=True)
# title lineid lemma count
# 0 A 0 order 1
# 1 A 1 order 0
# 2 A 2 order 0
# 3 A 0 shop 1
# 4 A 1 shop 0
# 5 A 2 shop 0
# 6 A 0 text 0
# 7 A 1 text 0
# 8 A 2 text 1