使用用户定义的词典在句子中标记单词

Question

我有超过 10 万个句子的语料库，而且我有字典。我想匹配语料库中的单词并在句子中标记它们

语料库文件"testing.txt"

Hello how are you doing. HiV is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.

词典文件"dict.csv"

abc, anxiety, disorder
def, HIV, virus
hij, Malaria, virus
klm, headache, symptom

我的python程序

import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams

import codecs

with open('dictionary.csv','r') as csvFile:
    reader = csv.reader(csvFile)
    myfile = open("testing.txt", "rt")
    my2file = open("match.txt" ,"w")
    hay = myfile.read()
    myfile.close()

for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            my2file.writelines(str)
            print(str)

csvFile.close()

我现在的输出是

 disorder 0.9333333333333333 anxiety
 virus 0.9333333333333333 Malaria

我希望输出为

 Hello how are you doing. HIV [virus] is dangerous
 Malaria [virus] can be cure.
 he has anxiety [disorder] thats why he is behaving like that

Answer 1

您可以遍历 testing.txt 上的行并替换这些值，像这样应该可以工作：

...
if similarity > max_sim_val:
    max_sim_val = similarity
    max_sim_string = hay_ngram
    str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
    my2file.writelines(str)
    print(str)

    for line in hay.splitlines():
        if max_sim_string in line:
            print(line.replace(max_sim_string, f"{max_sim_string} [{row[1]}]"))
            break

使用用户定义的词典在句子中标记单词

Tagging words in sentences using user define dictionary

python

dictionary

named-entity-recognition