为什么我的输出 return 是条带格式,而不能是 lemmatized/stemmed in Python?

Why my output return in a strip-format and cannot be lemmatized/stemmed in Python?

第一步是使用 NLTK 对数据框中的文本进行标记。然后,我使用 TextBlob 创建拼写更正。为此,我将输出从元组转换为字符串。之后,我需要 lemmatize/stem(使用 NLTK)。问题是我的输出 return 为条带格式。因此,它不能是 lemmatized/stemmed.

#create a dataframe
import pandas as pd
import nltk
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})

#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
    return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(token)

#spelling correction
def spell_eng(text):
  text=TextBlob(str(text)).correct()
  #convert from tuple to str
  text=functools.reduce(operator.add, (text))
  return text
df['text3'] = df['text2'].apply(spell_eng)


#lemmatization/stemming
def stem_eng(text):
   lemmatizer = nltk.stem.WordNetLemmatizer()
   return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)

生成的输出:

期望输出:

text4
--------------
[spell]
[be]
[work,cook,listen]
[study]

我知道问题出在哪里了,数据帧将这些数组存储为字符串。因此,词形还原不起作用。另请注意,它来自 spell_eng 部分。

我写了一个解决方案,对你的代码稍作修改。

import pandas as pd
import nltk
from textblob import TextBlob
import functools
import operator

df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})

#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
    return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(tokenize)


# spelling correction
def spell_eng(text):
    text = [TextBlob(str(w)).correct() for w in text] #CHANGE
    #convert from tuple to str
    text = [functools.reduce(operator.add, (w)) for w in text] #CHANGE
    return text

df['text3'] = df['text2'].apply(spell_eng)


# lemmatization/stemming
def stem_eng(text):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    return [lemmatizer.lemmatize(w,'v') for w in text] 
df['text4'] = df['text3'].apply(stem_eng)
df['text4']

希望这些内容对您有所帮助。