如何使用spacy统计某列每一行的字数?

How to use spacy to count the number of words in each row of a column?

我有一个这样的数据框:

import pandas as pd
data = [['A', '</b>          A 1960\'s era\n\n<b>'], ['B', '</b>           Yeah, I know.\n'], ['C', '</b>           This is a cute cat\n\n<b>']]
df = pd.DataFrame(data, columns = ['id', 'text'])
>>> df
  id                                      text
0  A         </b>          A 1960's era\n\n<b>
1  B            </b>           Yeah, I know.\n
2  C  </b>           This is a cute cat\n\n<b>

有什么使用 spacy 来统计字数的建议吗?

理想的输出是:

data = [['A', '</b>          A 1960\'s era\n\n<b>', 3], ['B', '</b>           Yeah, I know.\n' , 3], ['C', '</b>           This is a cute cat\n\n<b>',5]]
df = pd.DataFrame(data, columns = ['id', 'text','number_of_words'])
>>> df
  id                                      text  number_of_words
0  A         </b>          A 1960's era\n\n<b>                3
1  B            </b>           Yeah, I know.\n                3
2  C  </b>           This is a cute cat\n\n<b>                5

不宽敞,但也许...

尝试:

df['text2'] = df['text'].str.replace(r'<[^>]*>', '', regex=True).str.strip()
df['len'] = df['text2'].str.split().str.len()
print(df)

输出:

    id  text                            text2               len
0   A   </b> A 1960's era\n\n<b>        A 1960's era        3
1   B   </b> Yeah, I know.\n            Yeah, I know.       3
2   C   </b> This is a cute cat\n\n<b>  This is a cute cat  5

如果您想使用 spacy 分词器,我建议您进行一些列表理解:

nlp = spacy.load('en_core_web_sm')
myvals = df['text'].values.tolist()

df['number_of_words'] = [len([i for i in nlp(re.sub('</b>|<b>', '', i2)) if not (i.is_space or i.is_punct)]) for i2 in myvals]

一行基本上说:对于列表中的元素,去掉 html 标记,用 spacy 处理,如果它既不是 space 也不是标点符号,则计算标记。

输出:

id  text                               number_of_words
0   A   </b> A 1960's era\n\n<b>       4
1   B   </b> Yeah, I know.\n           3
2   C   </b> This is a cute cat\n\n<b> 5

但是如您所见,'1960's' 在这里被 spacy 算作两个标记。

所以,除非你使用 spacy 有很大的优势,或者你确实需要一个特定的分词器,否则通过简单地拆分 spaces 来计算单词绝对是首选的方式,就像建议的那样 @MDR