spaCy 和文本清理，摆脱 ' '

Question

我正在使用 spaCy，python 正在尝试为 sklearn 清理一些文本。我运行循环：

for text in df.text_all:
    text = str(text)
    text = nlp(text)
    cleaned = [token.lemma_ for token in text if token.is_punct==False and token.is_stop==False]
    cleaned_text.append(' '.join(cleaned))

它工作得很好，但它留在了某些文本的   中。我以为 token.is_punct==False 过滤器会去掉它，但没有。我寻找类似 html 标签的东西，但找不到任何东西。有谁知道我能做什么？

Answer 1

您可以使用正则表达式：

import re

# ...
cleaned = [token.lemma_...

clean_regex = re.compile('<.*?>')
cleantext = re.sub(clean_regex, '', ' '.join(cleaned))

cleaned_text.append(cleantext)

注意：如果您的文本包含任何“<”字符（  标签除外），此方法将不起作用

希望对您有所帮助！

spaCy 和文本清理，摆脱 '<br /><br />'

spaCy and text cleaning, getting rid of '<br /><br />'

python

nlp

scikit-learn

spacy