Pre-process 带有 NLTK 的文本字符串

Pre-process text string with NLTK

我有一个包含 docid(文档 ID)、title(文章标题)、lineid(行 ID,也就是段落的位置)、文本和 tokencount(包括空格的单词数)的数据框 A :

  docid   title  lineid                                         text        tokencount
0     0     A        0   shopping and orders have become more com...                66
1     0     A        1  people wrote to the postal service online...                67
2     0     A        2   text updates really from the U.S. Postal...                43
...

我想创建一个基于 A 的新数据框,包括 titlelineidcountquery

query是包含“数据分析”、“短信”、“购物和订单”等一个或多个词的文本字符串。

countquery.

每个单词的计数

新数据框应如下所示:

title  lemma   count   lineid
  A    "data"    0        0
  A    "data"    1        1
  A    "data"    4        2
  A    "shop"    2        0
  A    "shop"    1        1
  A    "shop"    2        2
  B    "data"    4        0
  B    "data"    0        1
  B    "data"    2        2
  B    "shop"    9        0
  B    "shop"    3        1
  B    "shop"    1        2
...

如何创建一个函数来生成这个新的数据帧?


我从 A 创建了一个新的数据框 df,其中有一列 count

df = A[['title','lineid']]
df['count'] = 0
df.set_index(['title','lineid'], inplace=True)

此外,我还创建了一个函数来计算查询的字数。

from collections import Counter

def occurrence_counter(target_string, query):
    data = dict(Counter(target_string.split()))
    count = 0
    for key in query:
        if key in data:
            count += data[key]
    return count

但是,我如何使用它们来生成新数据框的函数?

这应该可以处理您的情况:

import pandas as pd
from collections import Counter

query = "data analysis"
wordlist = query.split(" ")
#print(wordlist)

# row wise frequency count
df['text_new']  = df.text.str.split().apply(lambda x: Counter(x))

output = pd.DataFrame()
# iterate row by row
for index, row in df.iterrows():
    temp = dict()
    for word in wordlist:
        temp['title']  = row['title']
        temp['lemma']  = word
        temp['count']  = row['text_new'][word]
        temp['lineid'] = row['lineid']
    
    output = output.append(temp, ignore_index=True)
#print(output)

如果我理解正确,您可以使用内置的 pandas 函数来完成此操作:Series.str.count() to count the queries; melt() 重塑最终的列结构。

给定样本 df:

df = pd.DataFrame({'docid': {0: 0, 1: 0, 2: 0}, 'title': {0: 'A', 1: 'A', 2: 'A'}, 'lineid': {0: 0, 1: 1, 2: 2}, 'text': {0: 'shopping and orders have become more com...',  1: 'people wrote to the postal service online...',  2: 'text updates really from the U.S. Postal...'}, 'tokencount': {0: 66, 1: 67, 2: 43}})

#   docid  title  lineid                                          text
# 0     0      A       0   shopping and orders have become more com...
# 1     0      A       1  people wrote to the postal service online...
# 2     0      A       2   text updates really from the U.S. Postal...

第一count() queries:

queries = ['order', 'shop', 'text']
df = df.assign(**{f'query_{query}': df.text.str.count(query) for query in queries})

#   docid  title  lineid                                          text  tokencount  query_order  query_shop  query_text
# 0     0      A       0   shopping and orders have become more com...          66            1           1           0
# 1     0      A       1  people wrote to the postal service online...          67            0           0           0
# 2     0      A       2   text updates really from the U.S. Postal...          43            0           0           1

然后melt()进入最后的列结构:

df.melt(
    id_vars=['title', 'lineid'],
    value_vars=[f'query_{query}' for query in queries],
    var_name='lemma',
    value_name='count',
).replace(r'^query_', '', regex=True)

#   title  lineid  lemma  count
# 0     A       0  order      1
# 1     A       1  order      0
# 2     A       2  order      0
# 3     A       0   shop      1
# 4     A       1   shop      0
# 5     A       2   shop      0
# 6     A       0   text      0
# 7     A       1   text      0
# 8     A       2   text      1