基于 NLTK 的词干提取和词形还原
NLTK-based stemming and lemmatization
我正在尝试使用 lemmatizer
预处理字符串,然后删除标点符号和数字。我正在使用下面的代码来执行此操作。我没有收到任何错误,但文本没有得到适当的预处理。仅删除停用词,但词形还原不起作用,标点符号和数字也保留。
from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)
我得到的最终输出是:
This beautiful day16~ . I ; working exercise45.^^^45 text34 .
预期输出应如下所示:
This beautiful day I work exercise text
我认为这就是您要查找的内容,但如评论者所述,请在调用词形还原器之前执行此操作。
>>>import re
>>>s = "This is a beautiful day16~. I am; working on an exercise45.^^^45text34."
>>>s = re.sub(r'[^A-Za-z ]', '', s)
This is a beautiful day I am working on an exercise text
不,您当前的方法行不通,因为您必须一次 一个单词 传递给 lemmatizer/stemmer,否则,那些函数将不知道将您的字符串解释为一个句子(他们期望单词)。
import re
__stop_words = set(nltk.corpus.stopwords.words('english'))
def clean(tweet):
cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
return ' '.join([lemmatizer.lemmatize(i, 'v')
for i in cleaned_tweet.split() if i not in __stop_words])
或者,您可以使用 PorterStemmer
,它的作用与词形还原相同,但没有上下文。
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
然后,像这样调用词干分析器:
stemmer.stem(i)
要正确处理推文,您可以使用以下代码:
import re
import nltk
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
""" Normalizes case and handles punctuation
Inputs:
text: str: raw text
lemmatizer: an instance of a class implementing the lemmatize() method
(the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
Outputs:
list(str): tokenized text
"""
bcd=[]
pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
text1= text.lower()
text1= re.sub(pattern,"", text1)
text1= text1.replace("'s "," ")
text1= text1.replace("'","")
text1= text1.replace("—", " ")
table= str.maketrans(string.punctuation,32*" ")
text1= text1.translate(table)
geek= nltk.word_tokenize(text1)
abc=nltk.pos_tag(geek)
output = []
for value in abc:
value = list(value)
if value[1][0] =="N":
value[1] = 'n'
elif value[1][0] =="V":
value[1] = 'v'
elif value[1][0] =="J":
value[1] = 'a'
elif value[1][0] =="R":
value[1] = 'r'
else:
value[1]='n'
output.append(value)
abc=output
for value in abc:
bcd.append(lemmatizer.lemmatize(value[0],pos=value[1]))
return bcd
这里我使用了post_tag(只有N、V、J、R,其余的也全部转换成名词)。这将 return 一个标记化和词形还原的单词列表。
我正在尝试使用 lemmatizer
预处理字符串,然后删除标点符号和数字。我正在使用下面的代码来执行此操作。我没有收到任何错误,但文本没有得到适当的预处理。仅删除停用词,但词形还原不起作用,标点符号和数字也保留。
from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)
我得到的最终输出是:
This beautiful day16~ . I ; working exercise45.^^^45 text34 .
预期输出应如下所示:
This beautiful day I work exercise text
我认为这就是您要查找的内容,但如评论者所述,请在调用词形还原器之前执行此操作。
>>>import re
>>>s = "This is a beautiful day16~. I am; working on an exercise45.^^^45text34."
>>>s = re.sub(r'[^A-Za-z ]', '', s)
This is a beautiful day I am working on an exercise text
不,您当前的方法行不通,因为您必须一次 一个单词 传递给 lemmatizer/stemmer,否则,那些函数将不知道将您的字符串解释为一个句子(他们期望单词)。
import re
__stop_words = set(nltk.corpus.stopwords.words('english'))
def clean(tweet):
cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
return ' '.join([lemmatizer.lemmatize(i, 'v')
for i in cleaned_tweet.split() if i not in __stop_words])
或者,您可以使用 PorterStemmer
,它的作用与词形还原相同,但没有上下文。
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
然后,像这样调用词干分析器:
stemmer.stem(i)
要正确处理推文,您可以使用以下代码:
import re
import nltk
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
""" Normalizes case and handles punctuation
Inputs:
text: str: raw text
lemmatizer: an instance of a class implementing the lemmatize() method
(the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
Outputs:
list(str): tokenized text
"""
bcd=[]
pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
text1= text.lower()
text1= re.sub(pattern,"", text1)
text1= text1.replace("'s "," ")
text1= text1.replace("'","")
text1= text1.replace("—", " ")
table= str.maketrans(string.punctuation,32*" ")
text1= text1.translate(table)
geek= nltk.word_tokenize(text1)
abc=nltk.pos_tag(geek)
output = []
for value in abc:
value = list(value)
if value[1][0] =="N":
value[1] = 'n'
elif value[1][0] =="V":
value[1] = 'v'
elif value[1][0] =="J":
value[1] = 'a'
elif value[1][0] =="R":
value[1] = 'r'
else:
value[1]='n'
output.append(value)
abc=output
for value in abc:
bcd.append(lemmatizer.lemmatize(value[0],pos=value[1]))
return bcd
这里我使用了post_tag(只有N、V、J、R,其余的也全部转换成名词)。这将 return 一个标记化和词形还原的单词列表。