如何从文件中读取 ngram,然后将它们与标记匹配
how to read ngrams from a file and then match them with tokens
我想读取保存在文件中的 ngram。然后将这些 ngram 的每个单词与我的语料库中的单个标记匹配,如果它匹配,然后将其替换为 ngram.let 说我有这些二元语法:
painful punishment
worldly life
straight path
Last Day
great reward
severe punishment
clear evidence
我想做的是读取第一个二元语法,然后将其拆分并将其第一个单词 "painful" 与我在语料库中的标记进行比较,它与标记匹配移动到下一个标记并将其与bigram 的下一个单词如果是 "punishment" 则将其替换为一个标记 "painful punsihment"。我不知道该怎么做。我想将此逻辑转换为 code.if 任何人都可以帮助我,我将非常感激。
首先,这不是 Whosebug 的问题(听起来像是家庭作业问题)。您可以通过 Google 轻松辨别实现此目的的各种方法。但是我会给你一个解决方案,因为我需要热身:
# -*- coding: utf-8 -*-
import traceback, sys, re
'''
Open the bigrams file and load into an array.
Assuming bigrams are cleaned (else, you can do this using method below).
'''
try:
with open('bigrams.txt') as bigrams_file:
bigrams = bigrams_file.read().splitlines()
except Exception:
print('BIGRAMS LOAD ERROR: '+str(traceback.format_exc()))
sys.exit(1)
test_input = 'There is clear good evidence a great reward is in store.'
'''
Clean input method.
'''
def clean_input(text_input):
text_input = text_input.lower()
text_input = text_input.strip(' \t\n\r')
alpha_num_underscore_only = re.compile(r'([^\s\w_])+', re.UNICODE)
text_input = alpha_num_underscore_only.sub(' ', text_input)
text_input = re.sub(' +', ' ', text_input)
return text_input.strip()
test_input_words = test_input.split()
test_input_clean = clean_input(test_input)
test_input_clean_words = test_input_clean.split()
'''
Loop through the test_input bigram by bigram.
If we match one, then increment the index to move onto the next bigram.
This is a quick implementation --- you can modify for efficiency, and higher-order n-grams.
'''
output_text = []
skip_index = 0
for i in range(len(test_input_clean_words)-1):
if i >= skip_index:
if ' '.join([test_input_clean_words[i], test_input_clean_words[i+1]]) in bigrams:
print(test_input_clean_words[i], test_input_clean_words[i+1])
skip_index = i+2
output_text.append('TOKEN_'+'_'.join([test_input_words[i], test_input_words[i+1]]).upper())
else:
skip_index = i+1
output_text.append(test_input_words[i])
output_text.append(test_input_words[len(test_input_clean_words)-1])
print(' '.join(output_text))
输入:
There is clear good evidence a great reward is in store.
输出:
There is clear good evidence a TOKEN_GREAT_REWARD is in store.
我想读取保存在文件中的 ngram。然后将这些 ngram 的每个单词与我的语料库中的单个标记匹配,如果它匹配,然后将其替换为 ngram.let 说我有这些二元语法:
painful punishment
worldly life
straight path
Last Day
great reward
severe punishment
clear evidence
我想做的是读取第一个二元语法,然后将其拆分并将其第一个单词 "painful" 与我在语料库中的标记进行比较,它与标记匹配移动到下一个标记并将其与bigram 的下一个单词如果是 "punishment" 则将其替换为一个标记 "painful punsihment"。我不知道该怎么做。我想将此逻辑转换为 code.if 任何人都可以帮助我,我将非常感激。
首先,这不是 Whosebug 的问题(听起来像是家庭作业问题)。您可以通过 Google 轻松辨别实现此目的的各种方法。但是我会给你一个解决方案,因为我需要热身:
# -*- coding: utf-8 -*-
import traceback, sys, re
'''
Open the bigrams file and load into an array.
Assuming bigrams are cleaned (else, you can do this using method below).
'''
try:
with open('bigrams.txt') as bigrams_file:
bigrams = bigrams_file.read().splitlines()
except Exception:
print('BIGRAMS LOAD ERROR: '+str(traceback.format_exc()))
sys.exit(1)
test_input = 'There is clear good evidence a great reward is in store.'
'''
Clean input method.
'''
def clean_input(text_input):
text_input = text_input.lower()
text_input = text_input.strip(' \t\n\r')
alpha_num_underscore_only = re.compile(r'([^\s\w_])+', re.UNICODE)
text_input = alpha_num_underscore_only.sub(' ', text_input)
text_input = re.sub(' +', ' ', text_input)
return text_input.strip()
test_input_words = test_input.split()
test_input_clean = clean_input(test_input)
test_input_clean_words = test_input_clean.split()
'''
Loop through the test_input bigram by bigram.
If we match one, then increment the index to move onto the next bigram.
This is a quick implementation --- you can modify for efficiency, and higher-order n-grams.
'''
output_text = []
skip_index = 0
for i in range(len(test_input_clean_words)-1):
if i >= skip_index:
if ' '.join([test_input_clean_words[i], test_input_clean_words[i+1]]) in bigrams:
print(test_input_clean_words[i], test_input_clean_words[i+1])
skip_index = i+2
output_text.append('TOKEN_'+'_'.join([test_input_words[i], test_input_words[i+1]]).upper())
else:
skip_index = i+1
output_text.append(test_input_words[i])
output_text.append(test_input_words[len(test_input_clean_words)-1])
print(' '.join(output_text))
输入:
There is clear good evidence a great reward is in store.
输出:
There is clear good evidence a TOKEN_GREAT_REWARD is in store.