如何获取文本中匹配的 n-gram 的偏移量
How to get offset of a matched an n-gram in text
我想匹配文本中的字符串 (n-gram),并用一种方法来获取它的偏移量:
string_to_match = "many workers are very underpaid"
text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."
所以我想得到一个像这样的元组 ("matched", 44, 75)
其中 44 是开始,75 是结束。
这是我构建的代码,但它仅适用于 unigram。
def extract_offsets(line, _len=len):
words = line.split()
index = line.index
offsets = []
append = offsets.append
running_offset = 0
for word in words:
word_offset = index(word, running_offset)
word_len = _len(word)
running_offset = word_offset + word_len
append(("matched", word_offset, running_offset - 1))
return offsets
def get_entities(offsets):
entities = []
for elm in offsets:
if elm[0] == "string_to_match": # here string_to_match is only one word
entities.append(elm)
return entities
offsets = extract_offsets(text)
entities = get_entities(offsets) # [("matched", start, end)]
任何使它适用于字符串序列或 n-gram 的技巧!!
您可以re.finditer()
并在匹配对象上调用span()
方法来获取匹配子串的开始和结束索引-
def m():
string_to_match = "many workers are very underpaid"
text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."
m = re.finditer(r'%s'%(string_to_match),text)
for x in m:
print x.group(0), x.span() # x.span() will return the beginning and the ending indices of the matched substring as a tuple
我想匹配文本中的字符串 (n-gram),并用一种方法来获取它的偏移量:
string_to_match = "many workers are very underpaid"
text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."
所以我想得到一个像这样的元组 ("matched", 44, 75)
其中 44 是开始,75 是结束。
这是我构建的代码,但它仅适用于 unigram。
def extract_offsets(line, _len=len):
words = line.split()
index = line.index
offsets = []
append = offsets.append
running_offset = 0
for word in words:
word_offset = index(word, running_offset)
word_len = _len(word)
running_offset = word_offset + word_len
append(("matched", word_offset, running_offset - 1))
return offsets
def get_entities(offsets):
entities = []
for elm in offsets:
if elm[0] == "string_to_match": # here string_to_match is only one word
entities.append(elm)
return entities
offsets = extract_offsets(text)
entities = get_entities(offsets) # [("matched", start, end)]
任何使它适用于字符串序列或 n-gram 的技巧!!
您可以re.finditer()
并在匹配对象上调用span()
方法来获取匹配子串的开始和结束索引-
def m():
string_to_match = "many workers are very underpaid"
text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."
m = re.finditer(r'%s'%(string_to_match),text)
for x in m:
print x.group(0), x.span() # x.span() will return the beginning and the ending indices of the matched substring as a tuple