Python: 如果我检查每个单词的 POS,如何加速词形还原?
Python: How to speed up lemmatisation if I check the POS for each word?
我是 NLP 新手。我希望词形还原。但是要明白对于WordNetLemmatizer来说,要看Noun,Verb等传入的词的类型
因此我尝试了下面的代码,但是速度很慢。基本上我所有的文本都保存在 df 中名为“文本”的列中。我通过循环每一行(选项 1)来使用 pre_process(text) 函数,但它很慢。
我试过应用(选项 2),但还是一样慢。
有什么办法可以加快吗?谢谢!
from nltk import WordNetLemmatizer, pos_tag
import pandas as pd
def pre_process(text):
words_only = words_only.lower().split()
lem = WordNetLemmatizer()
words_only1=[]
for j in range(0, len(words_only)):
pos_label = (pos_tag(words_only)[j][1][0]).lower()
word=words_only[j]
if pos_label == 'j': pos_label = 'a' # 'j' <--> 'a' reassignment
if pos_label in ['r']: # For adverbs it's a bit different
try:
word=wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name() # Could have errors for words like 'not'
except:
word=lem.lemmatize(word)
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
word=lem.lemmatize(word, pos=pos_label)
else: # For nouns and everything else as it is the default kwarg
word=lem.lemmatize(word)
words_only1.append(word)
words_only=words_only1
return( " ".join(words_only))
df=pd.read_excel( 'C:/Users/Desktop/TEST.xlsx',
sheet_name='Text',
engine='openpyxl')
**Option 1**
num_text = df.shape[0]
clean_text= []
for i in range(0, num_text):
clean_text.append(pre_process(df['Text'].iloc[i]))
**Option 2**
df_bd['Processed Text']=df['Text'].apply(pre_process_bow)
clean_text= df['Processed Text'].tolist()
通过快速回顾您的方法,我建议您在 for
循环之外调用 pos_tag
。否则,您会为每个单词调用此方法,这可能会很慢。根据 pos_tag
.
的复杂性,这已经可以稍微加快进程了
注意:我建议您使用 tqdm
。这会为您提供一个漂亮的进度条,并让您估算需要多长时间。
from tqdm import tqdm
def pre_process(text):
words_only = words_only.lower().split()
lem = WordNetLemmatizer()
words_only1=[]
pos_tags = pos_tag(words_only)
for word, word_pos_tag in tqdm(zip(words_only, pos_tags), total=len(words_only)):
pos_label = word_pos_tag[1][0].lower()
if pos_label == 'j':
pos_label = 'a' # 'j' <--> 'a' reassignment
if pos_label in ['r']: # For adverbs it's a bit different
try:
word=wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name() # Could have errors for words like 'not'
except:
word=lem.lemmatize(word)
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
word=lem.lemmatize(word, pos=pos_label)
else: # For nouns and everything else as it is the default kwarg
word=lem.lemmatize(word)
words_only1.append(word)
return(" ".join(words_only1))
我是 NLP 新手。我希望词形还原。但是要明白对于WordNetLemmatizer来说,要看Noun,Verb等传入的词的类型
因此我尝试了下面的代码,但是速度很慢。基本上我所有的文本都保存在 df 中名为“文本”的列中。我通过循环每一行(选项 1)来使用 pre_process(text) 函数,但它很慢。
我试过应用(选项 2),但还是一样慢。 有什么办法可以加快吗?谢谢!
from nltk import WordNetLemmatizer, pos_tag
import pandas as pd
def pre_process(text):
words_only = words_only.lower().split()
lem = WordNetLemmatizer()
words_only1=[]
for j in range(0, len(words_only)):
pos_label = (pos_tag(words_only)[j][1][0]).lower()
word=words_only[j]
if pos_label == 'j': pos_label = 'a' # 'j' <--> 'a' reassignment
if pos_label in ['r']: # For adverbs it's a bit different
try:
word=wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name() # Could have errors for words like 'not'
except:
word=lem.lemmatize(word)
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
word=lem.lemmatize(word, pos=pos_label)
else: # For nouns and everything else as it is the default kwarg
word=lem.lemmatize(word)
words_only1.append(word)
words_only=words_only1
return( " ".join(words_only))
df=pd.read_excel( 'C:/Users/Desktop/TEST.xlsx',
sheet_name='Text',
engine='openpyxl')
**Option 1**
num_text = df.shape[0]
clean_text= []
for i in range(0, num_text):
clean_text.append(pre_process(df['Text'].iloc[i]))
**Option 2**
df_bd['Processed Text']=df['Text'].apply(pre_process_bow)
clean_text= df['Processed Text'].tolist()
通过快速回顾您的方法,我建议您在 for
循环之外调用 pos_tag
。否则,您会为每个单词调用此方法,这可能会很慢。根据 pos_tag
.
注意:我建议您使用 tqdm
。这会为您提供一个漂亮的进度条,并让您估算需要多长时间。
from tqdm import tqdm
def pre_process(text):
words_only = words_only.lower().split()
lem = WordNetLemmatizer()
words_only1=[]
pos_tags = pos_tag(words_only)
for word, word_pos_tag in tqdm(zip(words_only, pos_tags), total=len(words_only)):
pos_label = word_pos_tag[1][0].lower()
if pos_label == 'j':
pos_label = 'a' # 'j' <--> 'a' reassignment
if pos_label in ['r']: # For adverbs it's a bit different
try:
word=wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name() # Could have errors for words like 'not'
except:
word=lem.lemmatize(word)
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
word=lem.lemmatize(word, pos=pos_label)
else: # For nouns and everything else as it is the default kwarg
word=lem.lemmatize(word)
words_only1.append(word)
return(" ".join(words_only1))