对 pandas 中的标记化列进行词形还原
Lemmatize tokenised column in pandas
我正在尝试对标记化列进行词形还原 comments_tokenized
我愿意:
import nltk
from nltk.stem import WordNetLemmatizer
# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]
df1['comments_lemmatized'] = df1["comments_tokenized"].apply(lemmatize_text)
但有
TypeError: unhashable type: 'list'
我该怎么做才能对带有词袋的列进行词形还原?
以及如何避免将 [不要] 划分为 [做,不] 的标记化问题?
您的功能已经接近完成!由于您在系列中使用 apply
,因此您无需在函数中专门调出该列。您也根本没有在函数中使用输入 text
。所以改变
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]
到
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(w) for w in text] ##Notice the use of text.
一个例子:
df = pd.DataFrame({'A':[["cats","cacti","geese","rocks"]]})
A
0 [cats, cacti, geese, rocks]
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(w) for w in text]
df['A'].apply(lemmatize_text)
0 [cat, cactus, goose, rock]
我正在尝试对标记化列进行词形还原 comments_tokenized
我愿意:
import nltk
from nltk.stem import WordNetLemmatizer
# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]
df1['comments_lemmatized'] = df1["comments_tokenized"].apply(lemmatize_text)
但有
TypeError: unhashable type: 'list'
我该怎么做才能对带有词袋的列进行词形还原?
以及如何避免将 [不要] 划分为 [做,不] 的标记化问题?
您的功能已经接近完成!由于您在系列中使用 apply
,因此您无需在函数中专门调出该列。您也根本没有在函数中使用输入 text
。所以改变
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]
到
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(w) for w in text] ##Notice the use of text.
一个例子:
df = pd.DataFrame({'A':[["cats","cacti","geese","rocks"]]})
A
0 [cats, cacti, geese, rocks]
def lemmatize_text(text):
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(w) for w in text]
df['A'].apply(lemmatize_text)
0 [cat, cactus, goose, rock]