如何使用 NLTK 就地替换二元语法?
How to replace bigrams in place using NLTK?
假设我有一个元组列表,top_n
,是文本语料库中最常见的 n
双字母组:
import nltk
from nltk import bigrams
from nltk import FreqDist
bi_grams = bigrams(text) # text is a list of strings (tokens)
fdistBigram = FreqDist(bi_grams)
n = 300
top_n= [list(t) for t in zip(*fdistBigram.most_common(n))][0]; top_n
>>> [('let', 'us'),
('us', 'know'),
('as', 'possible')
....
现在我想用 top_n
中的双字母词集替换 中的单词集实例。例如,假设我们有一个新变量 query
,它是一个字符串列表:
query = ['please','let','us','know','as','soon','as','possible']
会变成
['please','letus', 'usknow', 'as', 'soon', 'aspossible']
在所需的操作之后。更明确地说,我想搜索 query
的每个元素并检查第 i 个和第 (i+1) 个元素是否在 top_n
中;如果是,则将 query[i]
和 query[i+1]
替换为单个连接的二元语法,即 (query[i], query[i+1]) -> query[i] + query[i+1]
.
有没有什么方法可以使用 NLTK 来做到这一点,或者如果需要循环遍历 query
中的每个单词,最好的方法是什么?
鉴于您的代码和查询,如果单词在 top_n
中,它们将被贪婪地替换为它们的二元语法,这将达到目的:
lookup = set(top_n) # {('let', 'us'), ('as', 'soon')}
query = ['please', 'let', 'us', 'know', 'as', 'soon', 'as', 'possible']
answer = []
q_iter = iter(range(len(query)))
for idx in q_iter:
answer.append(query[idx])
if idx < (len(query) - 1) and (query[idx], query[idx+1]) in lookup:
answer[-1] += query[idx+1]
next(q_iter)
# if you don't want to skip over consumed
# second bi-gram elements and keep
# len(query) == len(answer), don't advance
# the iterator here, which also means you
# don't have to create the iterator in outer scope
print(answer)
结果(例如):
>> ['please', 'letus', 'know', 'assoon', 'as', 'possible']
备选答案:
from gensim.models.phrases import Phraser
from gensim.models import Phrases
phrases = Phrases(text, min_count=1500, threshold=0.01)
bigram = Phraser(phrases)
bigram[query]
>>> ['please', 'let_us', 'know', 'as', 'soon', 'as', 'possible']
不完全是问题中所需的输出,但它可以作为替代方案。输入 min_count
和 threshold
将强烈影响输出。感谢 this question here.
假设我有一个元组列表,top_n
,是文本语料库中最常见的 n
双字母组:
import nltk
from nltk import bigrams
from nltk import FreqDist
bi_grams = bigrams(text) # text is a list of strings (tokens)
fdistBigram = FreqDist(bi_grams)
n = 300
top_n= [list(t) for t in zip(*fdistBigram.most_common(n))][0]; top_n
>>> [('let', 'us'),
('us', 'know'),
('as', 'possible')
....
现在我想用 top_n
中的双字母词集替换 中的单词集实例。例如,假设我们有一个新变量 query
,它是一个字符串列表:
query = ['please','let','us','know','as','soon','as','possible']
会变成
['please','letus', 'usknow', 'as', 'soon', 'aspossible']
在所需的操作之后。更明确地说,我想搜索 query
的每个元素并检查第 i 个和第 (i+1) 个元素是否在 top_n
中;如果是,则将 query[i]
和 query[i+1]
替换为单个连接的二元语法,即 (query[i], query[i+1]) -> query[i] + query[i+1]
.
有没有什么方法可以使用 NLTK 来做到这一点,或者如果需要循环遍历 query
中的每个单词,最好的方法是什么?
鉴于您的代码和查询,如果单词在 top_n
中,它们将被贪婪地替换为它们的二元语法,这将达到目的:
lookup = set(top_n) # {('let', 'us'), ('as', 'soon')}
query = ['please', 'let', 'us', 'know', 'as', 'soon', 'as', 'possible']
answer = []
q_iter = iter(range(len(query)))
for idx in q_iter:
answer.append(query[idx])
if idx < (len(query) - 1) and (query[idx], query[idx+1]) in lookup:
answer[-1] += query[idx+1]
next(q_iter)
# if you don't want to skip over consumed
# second bi-gram elements and keep
# len(query) == len(answer), don't advance
# the iterator here, which also means you
# don't have to create the iterator in outer scope
print(answer)
结果(例如):
>> ['please', 'letus', 'know', 'assoon', 'as', 'possible']
备选答案:
from gensim.models.phrases import Phraser
from gensim.models import Phrases
phrases = Phrases(text, min_count=1500, threshold=0.01)
bigram = Phraser(phrases)
bigram[query]
>>> ['please', 'let_us', 'know', 'as', 'soon', 'as', 'possible']
不完全是问题中所需的输出,但它可以作为替代方案。输入 min_count
和 threshold
将强烈影响输出。感谢 this question here.