Python: 如果我检查每个单词的 POS,如何加速词形还原?

Python: How to speed up lemmatisation if I check the POS for each word?

我是 NLP 新手。我希望词形还原。但是要明白对于WordNetLemmatizer来说,要看Noun,Verb等传入的词的类型

因此我尝试了下面的代码,但是速度很慢。基本上我所有的文本都保存在 df 中名为“文本”的列中。我通过循环每一行(选项 1)来使用 pre_process(text) 函数,但它很慢。

我试过应用(选项 2),但还是一样慢。 有什么办法可以加快吗?谢谢!

from nltk import WordNetLemmatizer, pos_tag
import pandas as pd

def pre_process(text):
 
    words_only = words_only.lower().split()    

    lem = WordNetLemmatizer()
    words_only1=[]
    for j in range(0, len(words_only)):
        
        pos_label = (pos_tag(words_only)[j][1][0]).lower()
        word=words_only[j]
        
        if pos_label == 'j': pos_label = 'a'    # 'j' <--> 'a' reassignment
        
        if pos_label in ['r']:  # For adverbs it's a bit different
            try:
                word=wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name() # Could have errors for words like 'not'
            except:
                word=lem.lemmatize(word)

        elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
            word=lem.lemmatize(word, pos=pos_label)

        else:   # For nouns and everything else as it is the default kwarg
            word=lem.lemmatize(word)
        
        words_only1.append(word)
    
    words_only=words_only1
    return( " ".join(words_only)) 


df=pd.read_excel( 'C:/Users/Desktop/TEST.xlsx', 
                   sheet_name='Text', 
                   engine='openpyxl')

**Option 1**
num_text = df.shape[0]
clean_text= []
for i in range(0, num_text):
    clean_text.append(pre_process(df['Text'].iloc[i]))


**Option 2**
df_bd['Processed Text']=df['Text'].apply(pre_process_bow)
clean_text= df['Processed Text'].tolist()

通过快速回顾您的方法,我建议您在 for 循环之外调用 pos_tag。否则,您会为每个单词调用此方法,这可能会很慢。根据 pos_tag.

的复杂性,这已经可以稍微加快进程了

注意:我建议您使用 tqdm。这会为您提供一个漂亮的进度条,并让您估算需要多长时间。

from tqdm import tqdm

def pre_process(text):
    words_only = words_only.lower().split()    

    lem = WordNetLemmatizer()
    words_only1=[]
    pos_tags = pos_tag(words_only)
    for word, word_pos_tag in tqdm(zip(words_only, pos_tags), total=len(words_only)):
        pos_label = word_pos_tag[1][0].lower()
        if pos_label == 'j': 
            pos_label = 'a'    # 'j' <--> 'a' reassignment
        
        if pos_label in ['r']:  # For adverbs it's a bit different
            try:
                word=wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name() # Could have errors for words like 'not'
            except:
                word=lem.lemmatize(word)

        elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
            word=lem.lemmatize(word, pos=pos_label)

        else:   # For nouns and everything else as it is the default kwarg
            word=lem.lemmatize(word)
        
        words_only1.append(word)
    
    return(" ".join(words_only1))