将句子中的单词分开用于文本分类问题
Separate the words in the sentence for text classification problem
我正在解决一个文本分类问题,在注释我的数据时,我发现很长的词本身就是句子,但没有被 space 分隔。
我在注释数据点时发现的一个示例是:
经过无数次收购和转型,Anacomp 今天仍然存在,并且更加注重文档管理
期望的输出:
经过多次收购和转型,Anacomp 今天仍然存在,并且更加专注于文档管理。
我查看了 Keras、PyTorch 等各种框架,看它们是否提供任何功能来解决这个问题,但我找不到任何东西。
您要解决的问题是 text/word 分段。可以使用序列模型(例如 LSTM)和词嵌入(例如 BERT)基于 ML 来解决这个问题。
This link 详细介绍了这种针对中文的方法。中文不采用空格,所以这种方法作为中文NLP处理任务中的预处理组件是必要的。
我想使用 Aho-Corasick Algorithm.
描述一种基于自动机的方法
先做一个pip install pyahocorasick
为了演示,我只使用了您输入字符串中的单词。在现实世界中,您可以只使用 Wordnet.
之类的单词字典
import ahocorasick
automaton = ahocorasick.Automaton()
input = 'Throughnumerousacquisitionsandtransitions, Anacompstillexiststodaywithagreaterfocusondocumentmanagement'
# Replace this with a large dictionary of words
word_dictionary = ['Through', 'numerous', 'acquisition', 'acquisitions', 'and', 'transitions', 'Anacomp', 'still',
'exists', 'today', 'with', 'a', 'greater', 'focus', 'on', 'document', 'management']
# add dictionary words to automaton
for idx, key in enumerate(word_dictionary):
automaton.add_word(key, (idx, key))
# Build aho-corasick automaton for search
automaton.make_automaton()
# to check for ambiguity, if there is a longer match then prefer that
previous_rng = range(0, 0)
previous_rs = set(previous_rng)
# Holds the end result dictionary
result = {}
# search the inputs using automaton
for end_index, (insert_order, original_value) in automaton.iter(input):
start_index = end_index - len(original_value) + 1
current_rng = range(start_index, end_index)
current_rs = set(current_rng)
# ignore previous as there is a longer match available
if previous_rs.issubset(current_rs):
# remove ambiguous short entry in favour of the longer entry
if previous_rng in result:
del result[previous_rng]
result[current_rng] = (insert_order, original_value)
previous_rng = current_rng
previous_rs = current_rs
# if there is no overlap of indices, then its a new token, add to result
elif previous_rs.isdisjoint(current_rs):
previous_rng = current_rng
previous_rs = current_rs
result[current_rng] = (insert_order, original_value)
# ignore current as it is a subset of previous
else:
continue
assert input[start_index:start_index + len(original_value)] == original_value
for x in result:
print(x, result[x])
产生结果:
range(0, 6) (0, 'Through')
range(7, 14) (1, 'numerous')
range(15, 26) (3, 'acquisitions')
range(27, 29) (4, 'and')
range(30, 40) (5, 'transitions')
range(43, 49) (6, 'Anacomp')
range(50, 54) (7, 'still')
range(55, 60) (8, 'exists')
range(61, 65) (9, 'today')
range(66, 69) (10, 'with')
range(71, 77) (12, 'greater')
range(78, 82) (13, 'focus')
range(83, 84) (14, 'on')
range(85, 92) (15, 'document')
range(93, 102) (16, 'management')
我正在解决一个文本分类问题,在注释我的数据时,我发现很长的词本身就是句子,但没有被 space 分隔。
我在注释数据点时发现的一个示例是:
经过无数次收购和转型,Anacomp 今天仍然存在,并且更加注重文档管理
期望的输出:
经过多次收购和转型,Anacomp 今天仍然存在,并且更加专注于文档管理。
我查看了 Keras、PyTorch 等各种框架,看它们是否提供任何功能来解决这个问题,但我找不到任何东西。
您要解决的问题是 text/word 分段。可以使用序列模型(例如 LSTM)和词嵌入(例如 BERT)基于 ML 来解决这个问题。
This link 详细介绍了这种针对中文的方法。中文不采用空格,所以这种方法作为中文NLP处理任务中的预处理组件是必要的。
我想使用 Aho-Corasick Algorithm.
描述一种基于自动机的方法先做一个pip install pyahocorasick
为了演示,我只使用了您输入字符串中的单词。在现实世界中,您可以只使用 Wordnet.
之类的单词字典import ahocorasick
automaton = ahocorasick.Automaton()
input = 'Throughnumerousacquisitionsandtransitions, Anacompstillexiststodaywithagreaterfocusondocumentmanagement'
# Replace this with a large dictionary of words
word_dictionary = ['Through', 'numerous', 'acquisition', 'acquisitions', 'and', 'transitions', 'Anacomp', 'still',
'exists', 'today', 'with', 'a', 'greater', 'focus', 'on', 'document', 'management']
# add dictionary words to automaton
for idx, key in enumerate(word_dictionary):
automaton.add_word(key, (idx, key))
# Build aho-corasick automaton for search
automaton.make_automaton()
# to check for ambiguity, if there is a longer match then prefer that
previous_rng = range(0, 0)
previous_rs = set(previous_rng)
# Holds the end result dictionary
result = {}
# search the inputs using automaton
for end_index, (insert_order, original_value) in automaton.iter(input):
start_index = end_index - len(original_value) + 1
current_rng = range(start_index, end_index)
current_rs = set(current_rng)
# ignore previous as there is a longer match available
if previous_rs.issubset(current_rs):
# remove ambiguous short entry in favour of the longer entry
if previous_rng in result:
del result[previous_rng]
result[current_rng] = (insert_order, original_value)
previous_rng = current_rng
previous_rs = current_rs
# if there is no overlap of indices, then its a new token, add to result
elif previous_rs.isdisjoint(current_rs):
previous_rng = current_rng
previous_rs = current_rs
result[current_rng] = (insert_order, original_value)
# ignore current as it is a subset of previous
else:
continue
assert input[start_index:start_index + len(original_value)] == original_value
for x in result:
print(x, result[x])
产生结果:
range(0, 6) (0, 'Through')
range(7, 14) (1, 'numerous')
range(15, 26) (3, 'acquisitions')
range(27, 29) (4, 'and')
range(30, 40) (5, 'transitions')
range(43, 49) (6, 'Anacomp')
range(50, 54) (7, 'still')
range(55, 60) (8, 'exists')
range(61, 65) (9, 'today')
range(66, 69) (10, 'with')
range(71, 77) (12, 'greater')
range(78, 82) (13, 'focus')
range(83, 84) (14, 'on')
range(85, 92) (15, 'document')
range(93, 102) (16, 'management')