spacy 如何将主题标签作为一个整体标记化?

How could spacy tokenize hashtag as a whole?

在包含主题标签的句子中,例如推文,spacy 的标记器将主题标签拆分为两个标记:

import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]

输出:

[This, is, a, #, sentence, .]

我想按如下方式标记标签,这可能吗?

[This, is, a, #sentence, .]
  1. 您可以进行一些前置和 post 字符串操作,这将使您绕过基于“#”的标记化,并且易于实现。例如
> >>> import re
> >>> import spacy
> >>> nlp = spacy.load('en')
> >>> sentence = u'This is my twitter update #MyTopic'
> >>> parsed = nlp(sentence)
> >>> [token.text for token in parsed]
 [u'This', u'is', u'my', u'twitter', u'update', u'#', u'MyTopic']
> >>> new_sentence = re.sub(r'#(\w+)',r'ZZZPLACEHOLDERZZZ',sentence) 
> >>> new_sentence u'This is my twitter update ZZZPLACEHOLDERZZZMyTopic'
> >>> parsed = nlp(new_sentence)
> >>> [token.text for token in parsed]
 [u'This', u'is', u'my', u'twitter', u'update', u'ZZZPLACEHOLDERZZZMyTopic']
> >>> [x.replace(u'ZZZPLACEHOLDERZZZ','#') for x in [token.text for token in parsed]]
 [u'This', u'is', u'my', u'twitter', u'update', u'#MyTopic']
  1. 您可以尝试在 spacy 的分词器中设置自定义分隔符。 我不知道这样做的方法。

UPDATE :您可以使用正则表达式查找您希望保留为单个标记的标记范围,并使用此处提到的 span.merge 方法重新标记:https://spacy.io/docs/api/span#merge

合并示例:

>>> import spacy
>>> import re
>>> nlp = spacy.load('en')
>>> my_str = u'Tweet hashtags #MyHashOne #MyHashTwo'
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#', u'NOUN'), (u'MyHashOne', u'NOUN'), (u'#', u'NOUN'), (u'MyHashTwo', u'PROPN')]
>>> indexes = [m.span() for m in re.finditer('#\w+',my_str,flags=re.IGNORECASE)]
>>> indexes
[(15, 25), (26, 36)]
>>> for start,end in indexes:
...     parsed.merge(start_idx=start,end_idx=end)
... 
#MyHashOne
#MyHashTwo
>>> [(x.text,x.pos_) for x in parsed]
[(u'Tweet', u'PROPN'), (u'hashtags', u'NOUN'), (u'#MyHashOne', u'NOUN'), (u'#MyHashTwo', u'PROPN')]
>>> 

这更像是对@DhruvPathak 的出色回答的 add-on 和来自以下链接的 无耻副本 github 线程(@csvance 的回答更好)。 spaCy 具有(自 V2.0 起)add_pipe 方法。这意味着您可以在一个函数中定义@DhruvPathak 很好的答案,并将该步骤(方便地)添加到您的 nlp 处理管道中,如下所示。

Citations starts here:

def hashtag_pipe(doc):
    merged_hashtag = False
    while True:
        for token_index,token in enumerate(doc):
            if token.text == '#':
                if token.head is not None:
                    start_index = token.idx
                    end_index = start_index + len(token.head.text) + 1
                    if doc.merge(start_index, end_index) is not None:
                        merged_hashtag = True
                        break
        if not merged_hashtag:
            break
        merged_hashtag = False
    return doc

nlp = spacy.load('en')
nlp.add_pipe(hashtag_pipe)

doc = nlp("twitter #hashtag")
assert len(doc) == 2
assert doc[0].text == 'twitter'
assert doc[1].text == '#hashtag'

Citation ends here; Check out how to add hashtags to the part of speech tagger #503 for the full thread.

PS看代码就清楚了,但是复制粘贴的不要关闭解析器:)

我在 github 上找到了这个,它使用了 spaCy 的 Matcher:

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
matcher.add('HASHTAG', None, [{'ORTH': '#'}, {'IS_ASCII': True}])

doc = nlp('This is a #sentence. Here is another #hashtag. #The #End.')
matches = matcher(doc)
hashtags = []
for match_id, start, end in matches:
    hashtags.append(doc[start:end])

for span in hashtags:
    span.merge()

print([t.text for t in doc])

outputs:

['This', 'is', 'a', '#sentence', '.', 'Here', 'is', 'another', '#hashtag', '.', '#The', '#End', '.']

hashtags 列表中还提供了主题标签列表:

print(hashtags)

输出:

[#sentence, #hashtag, #The, #End]

我在这上面花了很多时间,发现我分享了我的想法: 对 Tokenizer 进行子类化并将主题标签的正则表达式添加到默认 URL_PATTERN 对我来说是最简单的解决方案,另外添加自定义扩展以匹配主题标签以识别它们:

import re
import spacy
from spacy.language import Language
from spacy.tokenizer import Tokenizer
from spacy.tokens import Token

nlp = spacy.load('en_core_web_sm')

def create_tokenizer(nlp):
    # contains the regex to match all sorts of urls:
    from spacy.lang.tokenizer_exceptions import URL_PATTERN

    # spacy defaults: when the standard behaviour is required, they
    # need to be included when subclassing the tokenizer
    prefix_re = spacy.util.compile_prefix_regex(Language.Defaults.prefixes)
    infix_re = spacy.util.compile_infix_regex(Language.Defaults.infixes)
    suffix_re = spacy.util.compile_suffix_regex(Language.Defaults.suffixes)

    # extending the default url regex with regex for hashtags with "or" = |
    hashtag_pattern = r'''|^(#[\w_-]+)$'''
    url_and_hashtag = URL_PATTERN + hashtag_pattern
    url_and_hashtag_re = re.compile(url_and_hashtag)

    # set a custom extension to match if token is a hashtag
    hashtag_getter = lambda token: token.text.startswith('#')
    Token.set_extension('is_hashtag', getter=hashtag_getter)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer,
                     token_match=url_and_hashtag_re.match
                     )

nlp.tokenizer = create_tokenizer(nlp)
doc = nlp("#spreadhappiness #smilemore so_great@good.com https://www.somedomain.com/foo")

for token in doc:
    print(token.text)
    if token._.is_hashtag:
        print("-> matches hashtag")

# returns: "#spreadhappiness -> matches hashtag #smilemore -> matches hashtag so_great@good.com https://www.somedomain.com/foo"

我还尝试了几种方法来防止 spaCy 拆分主题标签或带有连字符的单词,例如 "cutting-edge"。我的经验是,事后合并标记可能会有问题,因为 pos 标记器和依赖解析器已经使用了错误的标记来进行决策。接触中缀、前缀、后缀正则表达式有点容易出错/复杂,因为您不想通过更改产生副作用。

最简单的方法确实如前所述,修改分词器的token_match函数。这是一个 re.match 标识不会被拆分的正则表达式。我宁愿扩展任何 spaCy 的默认模式,而不是导入特定的 URL 模式。

from spacy.tokenizer import _get_regex_pattern

nlp = spacy.load('en')

# get default pattern for tokens that don't get split
re_token_match = _get_regex_pattern(nlp.Defaults.token_match)
# add your patterns (here: hashtags and in-word hyphens)
re_token_match = f"({re_token_match}|#\w+|\w+-\w+)"

# overwrite token_match function of the tokenizer
nlp.tokenizer.token_match = re.compile(re_token_match).match

text = "@Pete: choose low-carb #food #eatsmart ;-) "
doc = nlp(text)

这产生:

['@Pete', ':', 'choose', 'low-carb', '#food', '#eatsmart', ';-)', '', '']