如何防止在 NLTK 中拆分特定的单词或短语和数字?
How to prevent splitting specific words or phrases and numbers in NLTK?
当我对拆分特定单词、日期和数字的文本进行标记时,文本匹配出现问题。在 NLTK 中标记单词时,如何防止某些短语(如 "run in my family"、“30 分钟步行”或“每天 4 次”)拆分?
它们不应导致:
['runs','in','my','family','4x','a','day']
例如:
Yes 20-30 minutes a day on my bike, it works great!!
给出:
['yes','20-30','minutes','a','day','on','my','bike',',','it','works','great']
我希望将“20-30 分钟”视为一个词。我怎样才能得到这种行为>?
据我所知,在标记化的同时保存各种长度的 n-gram 会很困难,但您可以找到这些 n-gram,如图所示 。然后,您可以将语料库中的项目替换为 n-grams 和一些连接字符,如破折号。
这是一个示例解决方案,但可能有很多方法可以实现。 重要说明:我提供了一种方法来查找文本中常见的 ngram(您可能需要多个 ngram,因此我在此处放置了一个变量,以便您可以决定其中有多少个要收集的 ngrams。您可能希望每种类型都有不同的数字,但我现在只给了 1 个变量。)这可能会错过您认为重要的 ngrams。为此,您可以将要查找的内容添加到 user_grams
。这些将被添加到搜索中。
import nltk
#an example corpus
corpus='''A big tantrum runs in my family 4x a day, every week.
A big tantrum is lame. A big tantrum causes strife. It runs in my family
because of our complicated history. Every week is a lot though. Every week
I dread the tantrum. Every week...Here is another ngram I like a lot'''.lower()
#tokenize the corpus
corpus_tokens = nltk.word_tokenize(corpus)
#create ngrams from n=2 to 5
bigrams = list(nltk.ngrams(corpus_tokens,2))
trigrams = list(nltk.ngrams(corpus_tokens,3))
fourgrams = list(nltk.ngrams(corpus_tokens,4))
fivegrams = list(nltk.ngrams(corpus_tokens,5))
此部分查找最多 five_grams 个常见 ngram。
#if you change this to zero you will only get the user chosen ngrams
n_most_common=1 #how many of the most common n-grams do you want.
fdist_bigrams = nltk.FreqDist(bigrams).most_common(n_most_common) #n most common bigrams
fdist_trigrams = nltk.FreqDist(trigrams).most_common(n_most_common) #n most common trigrams
fdist_fourgrams = nltk.FreqDist(fourgrams).most_common(n_most_common) #n most common four grams
fdist_fivegrams = nltk.FreqDist(fivegrams).most_common(n_most_common) #n most common five grams
#concat the ngrams together
fdist_bigrams=[x[0][0]+' '+x[0][1] for x in fdist_bigrams]
fdist_trigrams=[x[0][0]+' '+x[0][1]+' '+x[0][2] for x in fdist_trigrams]
fdist_fourgrams=[x[0][0]+' '+x[0][1]+' '+x[0][2]+' '+x[0][3] for x in fdist_fourgrams]
fdist_fivegrams=[x[0][0]+' '+x[0][1]+' '+x[0][2]+' '+x[0][3]+' '+x[0][4] for x in fdist_fivegrams]
#next 4 lines create a single list with important ngrams
n_grams=fdist_bigrams
n_grams.extend(fdist_trigrams)
n_grams.extend(fdist_fourgrams)
n_grams.extend(fdist_fivegrams)
本部分允许您将自己的 ngram 添加到列表中
#Another option here would be to make your own list of the ones you want
#in this example I add some user ngrams to the ones found above
user_grams=['ngram1 I like', 'ngram 2', 'another ngram I like a lot']
user_grams=[x.lower() for x in user_grams]
n_grams.extend(user_grams)
这最后一部分执行处理,以便您可以再次标记化并将 ngram 作为标记。
#initialize the corpus that will have combined ngrams
corpus_ngrams=corpus
#here we go through the ngrams we found and replace them in the corpus with
#version connected with dashes. That way we can find them when we tokenize.
for gram in n_grams:
gram_r=gram.replace(' ','-')
corpus_ngrams=corpus_ngrams.replace(gram, gram.replace(' ','-'))
#retokenize the new corpus so we can find the ngrams
corpus_ngrams_tokens= nltk.word_tokenize(corpus_ngrams)
print(corpus_ngrams_tokens)
Out: ['a-big-tantrum', 'runs-in-my-family', '4x', 'a', 'day', ',', 'every-week', '.', 'a-big-tantrum', 'is', 'lame', '.', 'a-big-tantrum', 'causes', 'strife', '.', 'it', 'runs-in-my-family', 'because', 'of', 'our', 'complicated', 'history', '.', 'every-week', 'is', 'a', 'lot', 'though', '.', 'every-week', 'i', 'dread', 'the', 'tantrum', '.', 'every-week', '...']
我觉得这其实是个很好的问题
您可以使用 MWETokenizer
:
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer([('20', '-', '30', 'minutes', 'a', 'day')])
tokenizer.tokenize(word_tokenize('Yes 20-30 minutes a day on my bike, it works great!!'))
[出局]:
['Yes', '20-30_minutes_a_day', 'on', 'my', 'bike', ',', 'it', 'works', 'great', '!', '!']
一个更有原则的方法,因为你不知道 `word_tokenize 将如何拆分你想保留的单词:
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
def multiword_tokenize(text, mwe):
# Initialize the MWETokenizer
protected_tuples = [word_tokenize(word) for word in mwe]
protected_tuples_underscore = ['_'.join(word) for word in protected_tuples]
tokenizer = MWETokenizer(protected_tuples)
# Tokenize the text.
tokenized_text = tokenizer.tokenize(word_tokenize(text))
# Replace the underscored protected words with the original MWE
for i, token in enumerate(tokenized_text):
if token in protected_tuples_underscore:
tokenized_text[i] = mwe[protected_tuples_underscore.index(token)]
return tokenized_text
mwe = ['20-30 minutes a day', '!!']
print(multiword_tokenize('Yes 20-30 minutes a day on my bike, it works great!!', mwe))
[出局]:
['Yes', '20-30 minutes a day', 'on', 'my', 'bike', ',', 'it', 'works', 'great', '!!']
当我对拆分特定单词、日期和数字的文本进行标记时,文本匹配出现问题。在 NLTK 中标记单词时,如何防止某些短语(如 "run in my family"、“30 分钟步行”或“每天 4 次”)拆分?
它们不应导致:
['runs','in','my','family','4x','a','day']
例如:
Yes 20-30 minutes a day on my bike, it works great!!
给出:
['yes','20-30','minutes','a','day','on','my','bike',',','it','works','great']
我希望将“20-30 分钟”视为一个词。我怎样才能得到这种行为>?
据我所知,在标记化的同时保存各种长度的 n-gram 会很困难,但您可以找到这些 n-gram,如图所示
这是一个示例解决方案,但可能有很多方法可以实现。 重要说明:我提供了一种方法来查找文本中常见的 ngram(您可能需要多个 ngram,因此我在此处放置了一个变量,以便您可以决定其中有多少个要收集的 ngrams。您可能希望每种类型都有不同的数字,但我现在只给了 1 个变量。)这可能会错过您认为重要的 ngrams。为此,您可以将要查找的内容添加到 user_grams
。这些将被添加到搜索中。
import nltk
#an example corpus
corpus='''A big tantrum runs in my family 4x a day, every week.
A big tantrum is lame. A big tantrum causes strife. It runs in my family
because of our complicated history. Every week is a lot though. Every week
I dread the tantrum. Every week...Here is another ngram I like a lot'''.lower()
#tokenize the corpus
corpus_tokens = nltk.word_tokenize(corpus)
#create ngrams from n=2 to 5
bigrams = list(nltk.ngrams(corpus_tokens,2))
trigrams = list(nltk.ngrams(corpus_tokens,3))
fourgrams = list(nltk.ngrams(corpus_tokens,4))
fivegrams = list(nltk.ngrams(corpus_tokens,5))
此部分查找最多 five_grams 个常见 ngram。
#if you change this to zero you will only get the user chosen ngrams
n_most_common=1 #how many of the most common n-grams do you want.
fdist_bigrams = nltk.FreqDist(bigrams).most_common(n_most_common) #n most common bigrams
fdist_trigrams = nltk.FreqDist(trigrams).most_common(n_most_common) #n most common trigrams
fdist_fourgrams = nltk.FreqDist(fourgrams).most_common(n_most_common) #n most common four grams
fdist_fivegrams = nltk.FreqDist(fivegrams).most_common(n_most_common) #n most common five grams
#concat the ngrams together
fdist_bigrams=[x[0][0]+' '+x[0][1] for x in fdist_bigrams]
fdist_trigrams=[x[0][0]+' '+x[0][1]+' '+x[0][2] for x in fdist_trigrams]
fdist_fourgrams=[x[0][0]+' '+x[0][1]+' '+x[0][2]+' '+x[0][3] for x in fdist_fourgrams]
fdist_fivegrams=[x[0][0]+' '+x[0][1]+' '+x[0][2]+' '+x[0][3]+' '+x[0][4] for x in fdist_fivegrams]
#next 4 lines create a single list with important ngrams
n_grams=fdist_bigrams
n_grams.extend(fdist_trigrams)
n_grams.extend(fdist_fourgrams)
n_grams.extend(fdist_fivegrams)
本部分允许您将自己的 ngram 添加到列表中
#Another option here would be to make your own list of the ones you want
#in this example I add some user ngrams to the ones found above
user_grams=['ngram1 I like', 'ngram 2', 'another ngram I like a lot']
user_grams=[x.lower() for x in user_grams]
n_grams.extend(user_grams)
这最后一部分执行处理,以便您可以再次标记化并将 ngram 作为标记。
#initialize the corpus that will have combined ngrams
corpus_ngrams=corpus
#here we go through the ngrams we found and replace them in the corpus with
#version connected with dashes. That way we can find them when we tokenize.
for gram in n_grams:
gram_r=gram.replace(' ','-')
corpus_ngrams=corpus_ngrams.replace(gram, gram.replace(' ','-'))
#retokenize the new corpus so we can find the ngrams
corpus_ngrams_tokens= nltk.word_tokenize(corpus_ngrams)
print(corpus_ngrams_tokens)
Out: ['a-big-tantrum', 'runs-in-my-family', '4x', 'a', 'day', ',', 'every-week', '.', 'a-big-tantrum', 'is', 'lame', '.', 'a-big-tantrum', 'causes', 'strife', '.', 'it', 'runs-in-my-family', 'because', 'of', 'our', 'complicated', 'history', '.', 'every-week', 'is', 'a', 'lot', 'though', '.', 'every-week', 'i', 'dread', 'the', 'tantrum', '.', 'every-week', '...']
我觉得这其实是个很好的问题
您可以使用 MWETokenizer
:
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer([('20', '-', '30', 'minutes', 'a', 'day')])
tokenizer.tokenize(word_tokenize('Yes 20-30 minutes a day on my bike, it works great!!'))
[出局]:
['Yes', '20-30_minutes_a_day', 'on', 'my', 'bike', ',', 'it', 'works', 'great', '!', '!']
一个更有原则的方法,因为你不知道 `word_tokenize 将如何拆分你想保留的单词:
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
def multiword_tokenize(text, mwe):
# Initialize the MWETokenizer
protected_tuples = [word_tokenize(word) for word in mwe]
protected_tuples_underscore = ['_'.join(word) for word in protected_tuples]
tokenizer = MWETokenizer(protected_tuples)
# Tokenize the text.
tokenized_text = tokenizer.tokenize(word_tokenize(text))
# Replace the underscored protected words with the original MWE
for i, token in enumerate(tokenized_text):
if token in protected_tuples_underscore:
tokenized_text[i] = mwe[protected_tuples_underscore.index(token)]
return tokenized_text
mwe = ['20-30 minutes a day', '!!']
print(multiword_tokenize('Yes 20-30 minutes a day on my bike, it works great!!', mwe))
[出局]:
['Yes', '20-30 minutes a day', 'on', 'my', 'bike', ',', 'it', 'works', 'great', '!!']