对 pandas 中的标记化列进行词形还原

Lemmatize tokenised column in pandas

我正在尝试对标记化列进行词形还原 comments_tokenized

我愿意:

import nltk
from nltk.stem import WordNetLemmatizer 

# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]

df1['comments_lemmatized'] = df1["comments_tokenized"].apply(lemmatize_text)

但有

TypeError: unhashable type: 'list'

我该怎么做才能对带有词袋的列进行词形还原?

以及如何避免将 [不要] 划分为 [做,不] 的标记化问题?

您的功能已经接近完成!由于您在系列中使用 apply,因此您无需在函数中专门调出该列。您也根本没有在函数中使用输入 text 。所以改变

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in df1["comments_tokenized"]]

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in text]  ##Notice the use of text.

一个例子:

df = pd.DataFrame({'A':[["cats","cacti","geese","rocks"]]})
                             A
0  [cats, cacti, geese, rocks]

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in text]

df['A'].apply(lemmatize_text)

0    [cat, cactus, goose, rock]