如何使用 NLTK 或 pywsd 进行词形还原

How to do lemmatization using NLTK or pywsd

我知道我的解释很长,但我觉得有必要。希望有人有耐心和乐于助人:) 我正在做一个情绪分析项目 atm,我被困在预处理部分。我导入了 csv 文件,将其制成数据框,将 variables/columns 转换为正确的数据类型。然后我像这样进行标记化,在数据框中选择我想要标记化的变量(推文内容)(df_tweet1):

# Tokenization
tknzr = TweetTokenizer()
tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]
for i in tokenized_sents:



# Stop word removal
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
#add words that aren't in the NLTK stopwords list
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)

clean_sents = []
for m in tokenized_sents:
    stop_m = [i for i in m if str(i).lower() not in new_stopwords_list]



1) 将前面的输出转换成字符串列表

new_test = [' '.join(x) for x in clean_sents]


from pywsd.utils import lemmatize_sentence

text = new_test
lemm_text = lemmatize_sentence(text, keepWordPOS=True)

我收到了这个错误: 类型错误:预期的字符串或类似字节的对象

2) 分别进行 POS 和词形还原。第一个使用 clean_sents 作为输入的 POS:

# PART-OF-SPEECH        
def process_content(clean_sents):
        tagged_list = []  
        for lst in clean_sents[:500]: 
            for item in lst:
                words = nltk.word_tokenize(item)
                tagged = nltk.pos_tag(words)
        return tagged_list

    except Exception as e:

output_POS_clean_sents = process_content(clean_sents)

输出是一个带有标签的单词列表 然后我想对这个输出进行词形还原,但是怎么做呢?我尝试了两个模块,但都给了我错误:

from pywsd.utils import lemmatize_sentence

lemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s]
              for s in output_POS_clean_sents]


from nltk.stem.wordnet import WordNetLemmatizer

lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in s]
              for s in output_POS_clean_sents]



AttributeError: 'tuple' 对象没有属性 'endswith'

第一部分new_test是一个字符串列表。 lemmatize_sentence 需要一个字符串,所以传递 new_test 会引发一个错误,就像你得到的那样。您必须分别传递每个字符串,然后根据每个词形还原字符串创建一个列表。所以:

text = new_test
lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]



import lemmy, re

def remove_stopwords(lst):
    with open('stopwords.txt', 'r') as sw:
        #read the stopwords file 
        stopwords = sw.read().split('\n')
        return [word for word in lst if not word in stopwords]

def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):
    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.

    -- body_text: string or list of strings
    -- language: language of the passed string(s), e.g. 'en', 'da' etc.

    if isinstance(body_text, str):
        body_text = [body_text] #Convert whatever passed to a list to support passing of single string

    if not hasattr(body_text, '__iter__'):
        raise TypeError('Passed argument should be a sequence.')

    lemmatizer = lemmy.load(language) #load lemmatizing dictionary

    lemma_list = [] #list to store each lemmatized string 

    word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words

    for string in body_text:
        #remove punctuation and split words
        matches = word_regex.findall(string)

        #split words and lowercase them unless they are all caps
        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]

        #remove words that are in the stopwords file
        if remove_stopwords_:
            lemmatized_string = remove_stopwords(lemmatized_string)

        #lemmatize each word and choose the shortest word of suggested lemmatizations
        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]

        #remove words that are in the stopwords file
        if remove_stopwords_:
            lemmatized_string = remove_stopwords(lemmatized_string)

        lemma_list.append(' '.join(lemmatized_string))

    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string




关于您的代码,它可以进行优化(例如,您可以同时执行停用词删除和标记化),我发现您执行的步骤有些混乱。例如,您执行多次词形还原,还使用不同的库,这样做是没有意义的。在我看来,nltk 工作得很好,我个人使用其他库来预处理推文,只是为了处理表情符号、url 和主题标签,所有与推文特别相关的东西。

# I won't write all the imports, you get them from your code
# define new column to store the processed tweets
df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)

tknzr = TweetTokenizer()
lmtzr = WordNetLemmatizer()

stop_words = set(stopwords.words("english"))
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)

# iterate through each tweet
for ind, row in df_tweet1.iterrows():

    # get initial tweet: ['This is the initial tweet']
    tweet = row['Tweet Content']

    # tokenisation, stopwords removal and lemmatisation all at once
    # out: ['initial', 'tweet']
    tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]

    # pos tag, no need to lemmatise again after.
    # out: [('initial', 'JJ'), ('tweet', 'NN')]
    tweet = nltk.pos_tag(tweet)

    # save processed tweet into the new column
    df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet

总的来说,您只需要 4 行,一行用于获取推文字符串,两行用于预处理文本,另一行用于存储推文。您可以添加额外的处理步骤,注意每个步骤的输出(例如标记化 return 字符串列表,pos 标记 return 元组列表,您遇到麻烦的原因)。


# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]
all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]