使用用户定义的词典在句子中标记单词
Tagging words in sentences using user define dictionary
我有超过 10 万个句子的语料库,而且我有字典。我想匹配语料库中的单词并在句子中标记它们
语料库文件"testing.txt"
Hello how are you doing. HiV is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
词典文件"dict.csv"
abc, anxiety, disorder
def, HIV, virus
hij, Malaria, virus
klm, headache, symptom
我的python程序
import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs
with open('dictionary.csv','r') as csvFile:
reader = csv.reader(csvFile)
myfile = open("testing.txt", "rt")
my2file = open("match.txt" ,"w")
hay = myfile.read()
myfile.close()
for row in reader:
needle = row[1]
needle_length = len(needle.split())
max_sim_val = 0.9
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
my2file.writelines(str)
print(str)
csvFile.close()
我现在的输出是
disorder 0.9333333333333333 anxiety
virus 0.9333333333333333 Malaria
我希望输出为
Hello how are you doing. HIV [virus] is dangerous
Malaria [virus] can be cure.
he has anxiety [disorder] thats why he is behaving like that
您可以遍历 testing.txt
上的行并替换这些值,像这样应该可以工作:
...
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
my2file.writelines(str)
print(str)
for line in hay.splitlines():
if max_sim_string in line:
print(line.replace(max_sim_string, f"{max_sim_string} [{row[1]}]"))
break
我有超过 10 万个句子的语料库,而且我有字典。我想匹配语料库中的单词并在句子中标记它们
语料库文件"testing.txt"
Hello how are you doing. HiV is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
词典文件"dict.csv"
abc, anxiety, disorder
def, HIV, virus
hij, Malaria, virus
klm, headache, symptom
我的python程序
import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs
with open('dictionary.csv','r') as csvFile:
reader = csv.reader(csvFile)
myfile = open("testing.txt", "rt")
my2file = open("match.txt" ,"w")
hay = myfile.read()
myfile.close()
for row in reader:
needle = row[1]
needle_length = len(needle.split())
max_sim_val = 0.9
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
my2file.writelines(str)
print(str)
csvFile.close()
我现在的输出是
disorder 0.9333333333333333 anxiety
virus 0.9333333333333333 Malaria
我希望输出为
Hello how are you doing. HIV [virus] is dangerous
Malaria [virus] can be cure.
he has anxiety [disorder] thats why he is behaving like that
您可以遍历 testing.txt
上的行并替换这些值,像这样应该可以工作:
...
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
my2file.writelines(str)
print(str)
for line in hay.splitlines():
if max_sim_string in line:
print(line.replace(max_sim_string, f"{max_sim_string} [{row[1]}]"))
break