如何获取文本中匹配的 n-gram 的偏移量

How to get offset of a matched an n-gram in text

我想匹配文本中的字符串 (n-gram),并用一种​​方法来获取它的偏移量:

string_to_match = "many workers are very underpaid" text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."

所以我想得到一个像这样的元组 ("matched", 44, 75) 其中 44 是开始,75 是结束。

这是我构建的代码,但它仅适用于 unigram。

def extract_offsets(line, _len=len):
    words = line.split()
    index = line.index
    offsets = []
    append = offsets.append
    running_offset = 0
    for word in words:
        word_offset = index(word, running_offset)
        word_len = _len(word)
        running_offset = word_offset + word_len
        append(("matched", word_offset, running_offset - 1))
    return offsets

def get_entities(offsets):
    entities = []
    for elm in offsets:
        if elm[0] == "string_to_match": # here string_to_match is only one word
            entities.append(elm)
    return entities

offsets = extract_offsets(text)
entities = get_entities(offsets) # [("matched", start, end)]

任何使它适用于字符串序列或 n-gram 的技巧!!

您可以re.finditer()并在匹配对象上调用span()方法来获取匹配子串的开始和结束索引-

def m():
    string_to_match = "many workers are very underpaid"
    text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."
    m = re.finditer(r'%s'%(string_to_match),text)
    for x in m:
        print x.group(0), x.span()     # x.span() will return the beginning and the ending indices of the matched substring as a tuple