如何从文件中读取 ngram，然后将它们与标记匹配

Question

我想读取保存在文件中的 ngram。然后将这些 ngram 的每个单词与我的语料库中的单个标记匹配，如果它匹配，然后将其替换为 ngram.let 说我有这些二元语法：

painful punishment
worldly life
straight path
Last Day
great reward
severe punishment
clear evidence

我想做的是读取第一个二元语法，然后将其拆分并将其第一个单词 "painful" 与我在语料库中的标记进行比较，它与标记匹配移动到下一个标记并将其与bigram 的下一个单词如果是 "punishment" 则将其替换为一个标记 "painful punsihment"。我不知道该怎么做。我想将此逻辑转换为 code.if 任何人都可以帮助我，我将非常感激。

Answer 1

首先，这不是 Whosebug 的问题（听起来像是家庭作业问题）。您可以通过 Google 轻松辨别实现此目的的各种方法。但是我会给你一个解决方案，因为我需要热身：

# -*- coding: utf-8 -*-

import traceback, sys, re

'''
Open the bigrams file and load into an array.
Assuming bigrams are cleaned (else, you can do this using method below).
'''
try:
    with open('bigrams.txt') as bigrams_file:
        bigrams = bigrams_file.read().splitlines()
except Exception:
    print('BIGRAMS LOAD ERROR: '+str(traceback.format_exc()))
    sys.exit(1)

test_input = 'There is clear good evidence a great reward is in store.'

'''
Clean input method.
'''
def clean_input(text_input):
    text_input = text_input.lower()
    text_input = text_input.strip(' \t\n\r')
    alpha_num_underscore_only = re.compile(r'([^\s\w_])+', re.UNICODE)
    text_input = alpha_num_underscore_only.sub(' ', text_input)
    text_input = re.sub(' +', ' ', text_input)
    return text_input.strip()

test_input_words = test_input.split()
test_input_clean = clean_input(test_input)
test_input_clean_words = test_input_clean.split()

'''
Loop through the test_input bigram by bigram.
If we match one, then increment the index to move onto the next bigram.
This is a quick implementation --- you can modify for efficiency, and higher-order n-grams.
'''
output_text = []
skip_index = 0
for i in range(len(test_input_clean_words)-1):
    if i >= skip_index:
        if ' '.join([test_input_clean_words[i], test_input_clean_words[i+1]]) in bigrams:
            print(test_input_clean_words[i], test_input_clean_words[i+1])
            skip_index = i+2
            output_text.append('TOKEN_'+'_'.join([test_input_words[i], test_input_words[i+1]]).upper())
        else:
            skip_index = i+1
            output_text.append(test_input_words[i])
output_text.append(test_input_words[len(test_input_clean_words)-1])

print(' '.join(output_text))

输入：

There is clear good evidence a great reward is in store.

输出：

There is clear good evidence a TOKEN_GREAT_REWARD is in store.

如何从文件中读取 ngram，然后将它们与标记匹配

how to read ngrams from a file and then match them with tokens

python

n-gram

python-3.x