如何使用spacy统计某列每一行的字数?
How to use spacy to count the number of words in each row of a column?
我有一个这样的数据框:
import pandas as pd
data = [['A', '</b> A 1960\'s era\n\n<b>'], ['B', '</b> Yeah, I know.\n'], ['C', '</b> This is a cute cat\n\n<b>']]
df = pd.DataFrame(data, columns = ['id', 'text'])
>>> df
id text
0 A </b> A 1960's era\n\n<b>
1 B </b> Yeah, I know.\n
2 C </b> This is a cute cat\n\n<b>
有什么使用 spacy 来统计字数的建议吗?
理想的输出是:
data = [['A', '</b> A 1960\'s era\n\n<b>', 3], ['B', '</b> Yeah, I know.\n' , 3], ['C', '</b> This is a cute cat\n\n<b>',5]]
df = pd.DataFrame(data, columns = ['id', 'text','number_of_words'])
>>> df
id text number_of_words
0 A </b> A 1960's era\n\n<b> 3
1 B </b> Yeah, I know.\n 3
2 C </b> This is a cute cat\n\n<b> 5
不宽敞,但也许...
尝试:
df['text2'] = df['text'].str.replace(r'<[^>]*>', '', regex=True).str.strip()
df['len'] = df['text2'].str.split().str.len()
print(df)
输出:
id text text2 len
0 A </b> A 1960's era\n\n<b> A 1960's era 3
1 B </b> Yeah, I know.\n Yeah, I know. 3
2 C </b> This is a cute cat\n\n<b> This is a cute cat 5
如果您想使用 spacy 分词器,我建议您进行一些列表理解:
nlp = spacy.load('en_core_web_sm')
myvals = df['text'].values.tolist()
df['number_of_words'] = [len([i for i in nlp(re.sub('</b>|<b>', '', i2)) if not (i.is_space or i.is_punct)]) for i2 in myvals]
一行基本上说:对于列表中的元素,去掉 html 标记,用 spacy 处理,如果它既不是 space 也不是标点符号,则计算标记。
输出:
id text number_of_words
0 A </b> A 1960's era\n\n<b> 4
1 B </b> Yeah, I know.\n 3
2 C </b> This is a cute cat\n\n<b> 5
但是如您所见,'1960's' 在这里被 spacy 算作两个标记。
所以,除非你使用 spacy 有很大的优势,或者你确实需要一个特定的分词器,否则通过简单地拆分 spaces 来计算单词绝对是首选的方式,就像建议的那样
@MDR
我有一个这样的数据框:
import pandas as pd
data = [['A', '</b> A 1960\'s era\n\n<b>'], ['B', '</b> Yeah, I know.\n'], ['C', '</b> This is a cute cat\n\n<b>']]
df = pd.DataFrame(data, columns = ['id', 'text'])
>>> df
id text
0 A </b> A 1960's era\n\n<b>
1 B </b> Yeah, I know.\n
2 C </b> This is a cute cat\n\n<b>
有什么使用 spacy 来统计字数的建议吗?
理想的输出是:
data = [['A', '</b> A 1960\'s era\n\n<b>', 3], ['B', '</b> Yeah, I know.\n' , 3], ['C', '</b> This is a cute cat\n\n<b>',5]]
df = pd.DataFrame(data, columns = ['id', 'text','number_of_words'])
>>> df
id text number_of_words
0 A </b> A 1960's era\n\n<b> 3
1 B </b> Yeah, I know.\n 3
2 C </b> This is a cute cat\n\n<b> 5
不宽敞,但也许...
尝试:
df['text2'] = df['text'].str.replace(r'<[^>]*>', '', regex=True).str.strip()
df['len'] = df['text2'].str.split().str.len()
print(df)
输出:
id text text2 len
0 A </b> A 1960's era\n\n<b> A 1960's era 3
1 B </b> Yeah, I know.\n Yeah, I know. 3
2 C </b> This is a cute cat\n\n<b> This is a cute cat 5
如果您想使用 spacy 分词器,我建议您进行一些列表理解:
nlp = spacy.load('en_core_web_sm')
myvals = df['text'].values.tolist()
df['number_of_words'] = [len([i for i in nlp(re.sub('</b>|<b>', '', i2)) if not (i.is_space or i.is_punct)]) for i2 in myvals]
一行基本上说:对于列表中的元素,去掉 html 标记,用 spacy 处理,如果它既不是 space 也不是标点符号,则计算标记。
输出:
id text number_of_words
0 A </b> A 1960's era\n\n<b> 4
1 B </b> Yeah, I know.\n 3
2 C </b> This is a cute cat\n\n<b> 5
但是如您所见,'1960's' 在这里被 spacy 算作两个标记。
所以,除非你使用 spacy 有很大的优势,或者你确实需要一个特定的分词器,否则通过简单地拆分 spaces 来计算单词绝对是首选的方式,就像建议的那样 @MDR