Python中多语言的symspellpy分词如何得到最好的合并?

How to get the best merger from symspellpy word segmentation of many languages in Python?

以下代码使用了Python中的SymSpell,见symspellpy guide on word_segmentation.

它使用来自github repo的“de-100k.txt”和“en-80k.txt”频率词典,您需要将它们保存在您的工作目录中。只要你不想使用任何SymSpell逻辑,你不需要安装和运行这个脚本来回答问题,只需要两种语言的分词输出就可以了。

import pkg_resources
from symspellpy.symspellpy import SymSpell

input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"

# German:
# Set max_dictionary_edit_distance to 0 to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "de-100k.txt"
)
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

# English:
# Reset the sym_spell object
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "en-80k.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

输出:

sonnen empfindlichkeit s uno i l farb palette sun creme, 8, -61.741842760725255
sonnen empfindlichkeit sun oil farb palette sun creme, 6, -45.923471400632884

目的是通过某种逻辑找出最相关的词:最频繁的 ngram neighours and/or 词频、最长词等。逻辑自由选择

在这个使用两种语言的示例中,需要比较两个输出,以便只保留最好的片段而丢弃其余部分,而不会截取部分单词。在结果中,每个字母被使用一次并且是唯一的。

如果input_term中的单词之间有spaces,这些单词不应该加入成为一个新的段。例如,如果你的 'cr eme' 中有一个错误的 space,那仍然不允许变成 'creme'。 space 很可能比使用相邻字母出现的错误更常见。

array('sonnen', 'empfindlichkeit', 'sun', 'oil', 'farb', 'palette', 'sun', 'creme')
array(['DE'], ['DE'], ['EN'], ['EN'], ['DE'], ['DE', 'EN'], ['EN'], ['DE', 'EN'])

'DE/EN' 标签只是一个可选的想法,表明该词存在于德语和英语中,在此示例中,您也可以选择 'EN' 而不是 'DE'。语言标签是加分项,没有语言标签也可以回答。

可能有一个快速的解决方案,使用 numpy 数组 and/or dictionaries 而不是 listsDataframes,但请随意选择。

如何在symspell分词中使用多种语言并将它们组合成一个选择的合并?目标是由所有字母组成的单词句子,每个字母使用一次,保留所有原始 spaces.

SimSpell 方式

这是推荐的方法。我是在手动操作后才发现的。您可以轻松地将用于一种语言的相同频率逻辑用于两种语言:只需将两种或更多语言加载到 sym_spell 对象中!

import pkg_resources
from symspellpy.symspellpy import SymSpell

input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme"

# Set max_dictionary_edit_distance to 0 to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "de-100k.txt"
)

# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

# DO NOT reset the sym_spell object at this line so that
# English is added to the German frequency dictionary
# NOT: #reset the sym_spell object
# NOT: #sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "en-80k.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
result = sym_spell.word_segmentation(input_term)
print(f"{result.corrected_string}, {result.distance_sum}, {result.log_prob_sum}")

输出:

sonnen empfindlichkeit s uno i l farb palette sun creme, 8, -61.741842760725255
sonnen empfindlichkeit sun oil farb palette sun creme, 6, -45.923471400632884

手动方式

在这种手动方式中,逻辑是:两种语言中较长的单词获胜,记录获胜者语言标签。如果它们的长度相同,则会记录两种语言。

在问题中,input_term = "sonnenempfindlichkeitsunoil farbpalettesuncreme" 对 SymSpell 中的每种语言分段使用重置对象,导致德语 s1 和英语 s2

import numpy as np

s1 = 'sonnen empfindlichkeit s uno i l farb palette sun creme'
s2 = 'son ne ne mp find li ch k e it sun oil far b palette sun creme'

num_letters = len(s1.replace(' ',''))
list_w1 = s1.split()
list_w2 = s2.split()
list_w1_len = [len(x) for x in list_w1]
list_w2_len = [len(x) for x in list_w2]

lst_de = [(x[0], x[1], x[2], 'de', x[3], x[4]) for x in zip(list_w1, list_w1_len, range(len(list_w1)), np.cumsum([0] + [len(x)+1 for x in list_w1][:-1]), np.cumsum([0] + [len(x) for x in list_w1][:-1]))]
lst_en = [(x[0], x[1], x[2], 'en', x[3], x[4]) for x in zip(list_w2, list_w2_len, range(len(list_w2)), np.cumsum([0] + [len(x)+1 for x in list_w2][:-1]), np.cumsum([0] + [len(x) for x in list_w2][:-1]))]

idx_word_de = 0
idx_word_en = 0
lst_words = []
idx_letter = 0

# stop at num_letters-1, else you check the last word 
# also on the last idx_letter and get it twice
while idx_letter <= num_letters-1:
lst_de[idx_word_de][5], idx_letter)
    while(lst_de[idx_word_de][5]<idx_letter):
        idx_word_de +=1
    while(lst_en[idx_word_en][5]<idx_letter):
        idx_word_en +=1

    if lst_de[idx_word_de][1]>lst_en[idx_word_en][1]:
        lst_word_stats = lst_de[idx_word_de]
        str_word = lst_word_stats[0]
#         print('de:', lst_de[idx_word_de])
        idx_letter += len(str_word) #lst_de[idx_word_de][0])
    elif lst_de[idx_word_de][1]==lst_en[idx_word_en][1]:
        lst_word_stats = (lst_de[idx_word_de][0], lst_de[idx_word_de][1], (lst_de[idx_word_de][2], lst_en[idx_word_en][2]), (lst_de[idx_word_de][3], lst_en[idx_word_en][3]), (lst_de[idx_word_de][4], lst_en[idx_word_en][4]), lst_de[idx_word_de][5])
        str_word = lst_word_stats[0]
#         print('de:', lst_de[idx_word_de], 'en:', lst_en[idx_word_en])
        idx_letter += len(str_word) #lst_de[idx_word_de][0])        
    else:
        lst_word_stats = lst_en[idx_word_en]
        str_word = lst_word_stats[0]
#         print('en:', lst_en[idx_word_en][0])
        idx_letter += len(str_word)
    lst_words.append(lst_word_stats)

输出lst_words:

[('sonnen', 6, 0, 'de', 0, 0),
 ('empfindlichkeit', 15, 1, 'de', 7, 6),
 ('sun', 3, 10, 'en', 31, 21),
 ('oil', 3, 11, 'en', 35, 24),
 ('farb', 4, 6, 'de', 33, 27),
 ('palette', 7, (7, 14), ('de', 'en'), (38, 45), 31),
 ('sun', 3, (8, 15), ('de', 'en'), (46, 53), 38),
 ('creme', 5, (9, 16), ('de', 'en'), (50, 57), 41)]

输出图例:

chosen word | len | word_idx_of_lang | lang | letter_idx_lang_with_spaces | letter_idx_no_spaces