为什么我的输出 return 是条带格式,而不能是 lemmatized/stemmed in Python?
Why my output return in a strip-format and cannot be lemmatized/stemmed in Python?
第一步是使用 NLTK 对数据框中的文本进行标记。然后,我使用 TextBlob 创建拼写更正。为此,我将输出从元组转换为字符串。之后,我需要 lemmatize/stem(使用 NLTK)。问题是我的输出 return 为条带格式。因此,它不能是 lemmatized/stemmed.
#create a dataframe
import pandas as pd
import nltk
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})
#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(token)
#spelling correction
def spell_eng(text):
text=TextBlob(str(text)).correct()
#convert from tuple to str
text=functools.reduce(operator.add, (text))
return text
df['text3'] = df['text2'].apply(spell_eng)
#lemmatization/stemming
def stem_eng(text):
lemmatizer = nltk.stem.WordNetLemmatizer()
return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)
生成的输出:
期望输出:
text4
--------------
[spell]
[be]
[work,cook,listen]
[study]
我知道问题出在哪里了,数据帧将这些数组存储为字符串。因此,词形还原不起作用。另请注意,它来自 spell_eng 部分。
我写了一个解决方案,对你的代码稍作修改。
import pandas as pd
import nltk
from textblob import TextBlob
import functools
import operator
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})
#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(tokenize)
# spelling correction
def spell_eng(text):
text = [TextBlob(str(w)).correct() for w in text] #CHANGE
#convert from tuple to str
text = [functools.reduce(operator.add, (w)) for w in text] #CHANGE
return text
df['text3'] = df['text2'].apply(spell_eng)
# lemmatization/stemming
def stem_eng(text):
lemmatizer = nltk.stem.WordNetLemmatizer()
return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)
df['text4']
希望这些内容对您有所帮助。
第一步是使用 NLTK 对数据框中的文本进行标记。然后,我使用 TextBlob 创建拼写更正。为此,我将输出从元组转换为字符串。之后,我需要 lemmatize/stem(使用 NLTK)。问题是我的输出 return 为条带格式。因此,它不能是 lemmatized/stemmed.
#create a dataframe
import pandas as pd
import nltk
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})
#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(token)
#spelling correction
def spell_eng(text):
text=TextBlob(str(text)).correct()
#convert from tuple to str
text=functools.reduce(operator.add, (text))
return text
df['text3'] = df['text2'].apply(spell_eng)
#lemmatization/stemming
def stem_eng(text):
lemmatizer = nltk.stem.WordNetLemmatizer()
return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)
生成的输出:
期望输出:
text4
--------------
[spell]
[be]
[work,cook,listen]
[study]
我知道问题出在哪里了,数据帧将这些数组存储为字符串。因此,词形还原不起作用。另请注意,它来自 spell_eng 部分。
我写了一个解决方案,对你的代码稍作修改。
import pandas as pd
import nltk
from textblob import TextBlob
import functools
import operator
df = pd.DataFrame({'text': ["spellling", "was", "working cooking listening","studying"]})
#tokenization
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
def tokenize(text):
return [w for w in w_tokenizer.tokenize(text)]
df["text2"] = df["text"].apply(tokenize)
# spelling correction
def spell_eng(text):
text = [TextBlob(str(w)).correct() for w in text] #CHANGE
#convert from tuple to str
text = [functools.reduce(operator.add, (w)) for w in text] #CHANGE
return text
df['text3'] = df['text2'].apply(spell_eng)
# lemmatization/stemming
def stem_eng(text):
lemmatizer = nltk.stem.WordNetLemmatizer()
return [lemmatizer.lemmatize(w,'v') for w in text]
df['text4'] = df['text3'].apply(stem_eng)
df['text4']
希望这些内容对您有所帮助。