使用词典在句子中标记单词

Tagging words in sentences using dictionares

我有超过 10 万个句子的语料库,而且我有字典。我想匹配语料库中的单词并在句子中标记它们

语料库文件"sentences.txt"

Hello how are you doing. Headache is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
she is doing well
he has psychological problems

词典文件"dict.csv"

abc, anxiety, disorder
def, Headache, symptom
hij, Malaria, virus
klm, headache, symptom

我的python程序

import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams

import codecs

with open('dictionary.csv','r') as csvFile:
    reader = csv.reader(csvFile)
    myfile = open("sentences.txt", "rt")
    my3file = open("tagged_sentences.txt", "w")
    hay = myfile.read()
    myfile.close()

for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            str1 = max_sim_string , row[2]
            for line in hay.splitlines():
                if max_sim_string in line:
                    tag_sent = line.replace(max_sim_string, str1.__str__())
                    my3file.writelines(tag_sent + '\n')
                    print(tag_sent)
            break

csvFile.close()

我现在的输出是

 he has ('anxiety', ' disorder') thats why he is behaving like that.
 ('Malaria', ' virus') can be cure
 Hello how are you doing. ('Headache', ' symptom') is dangerous

我希望我的输出为。我希望它在同一文件中标记句子中的单词 "sentences.txt" 或将其写入新文件“myfile3.txt。不打乱句子的顺序或完全忽略(不添加)它

 Hello how are you doing. ('Headache', 'symptom') is dangerous
 ('Malaria', ' virus') can be cure.
 he has ('anxiety', ' disorder') thats why he is behaving like that
 she is doing well
 he has psychological problems

如果您希望按照 句子 输入的顺序输出,则需要根据该顺序构建输出。相反,您将程序设计为按 字典 的顺序报告结果。您需要切换内循环和外循环。

将dict文件读入一个内部数据结构,这样你就不用一直重置和重新读取文件了。

然后阅读句子文件,一次一行。寻找要标记的词(你已经做得很好了)。边做边做替换,写出修改后的句子。

无需对您的代码做太多更改,这应该可以正常工作:

...
phrases = []
for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            str1 = max_sim_string , row[2]
            phrases.append((max_sim_string, row[2]))

for line in hay.splitlines():
    if any(max_sim_string in line for max_sim_string, _ in phrases):
        for phrase in phrases:
            max_sim_string, _ = phrase
            if max_sim_string in line:
                tag_sent = line.replace(max_sim_string, phrase.__str__())
                my3file.writelines(tag_sent + '\n')
                print(tag_sent)
                break        
    else:
        my3file.writelines(line + '\n')

csvFile.close()