如何在数据框中句子标记化

Question

我的数据框中的一列包含如下文本：

'This is very good. No it is very bad. Actually it is alright'

我想对本专栏中的文本进行句子标记化，本质上是创建一个嵌套的句子列表。

我试过了

def tokenizeAndList(text):

    raw_text = text
    nlp = English()
    nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated
    doc = nlp(raw_text)
    sentences = [sent.string.strip() for sent in doc.sents]
    return(sentences)

out=myText['findings'].map(tokenizeAndList)

这给了我错误：

TypeError: object of type 'NAType' has no len()

如何创建嵌套列表？

Answer 1

发生这种情况是因为您在 findings 列中有一些不是字符串类型的值。

在从中创建 Spacy 文档之前，您应该检查 text 是否属于 str 类型，否则 return 值原样：

nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))

def tokenizeAndList(text):
    if isinstance(text, str):
        doc = nlp(text)
        return [sent.string.strip() for sent in doc.sents]
    else:
        return text

如何在数据框中句子标记化

How to sentence tokenize within a dataframe

python

pandas

spacy