列表中的短语匹配

Phrase matching in lists

假设我有一个表示句子 ex 的列表:

sent = ['terras', 'ipsius', 'Azar', 'vocatas', 'Ta', 'Xellule', 'et', 'Ginen', 'Chagem', 'in', 'contrata', 'Deyr', 'Issafisaf']

和地名列表

places = ['Ta Xellule', 'Ginen Chagem', 'Deyr Issafisaf']

我怎样才能结束:

[('O','terras'), ('O', 'ipsius'), ('O', 'Azar'), ('O', vocatas'), ('PLACE', 'Ta'), ('PLACE', 'Xellule'), ('O','et'), ('PLACE', 'Ginen'), ('PLACE', 'Chagem'), ('O','in'), ('O','contrata'), ('PLACE', 'Deyr'), ('PLACE', 'Issafisaf')]

快速说明:

如果例如 Ta 必须仅在 Xellule 旁边。如果在句子的另一个上下文中找到,则不应将其标记为 PLACE ex: Ta Buni mar Ta Xellule...只应标记第二个 Ta。

这是我的地点列表的示例:

 'Ras il Huichile',
 'Ras il Hued',
 'Ta Richardu',
 'Roma',
 'Russilion',
 'La Rukiha',
 'Irrukiha ta il Bayada',
 'Casalis Milleri',
 'Ta Sabat',
 'Casalis Zebug',
 'Ta Zagra',
 'Sagra in  Ras il Hued',
 'Ta Isalme'

这是一个例句:

terras ipsius Azar vocatas Ta Xellule et Ginen Chagem in contrata Deyr Issafisaf cum iuribus suis omnibus

虽然它存在于 Ras il Hued 的 Sagra 中,但不应将其标记为地点

只是迭代和测试:

for word in sent:
    isPlace = False
    for place in places:
        if word in place:
            isPlace = True
    if isPlace:
        result.append(('PLACE', word))
    else:
        result.append(('O', word))

尝试这样的事情:

res = []

for x in sent:
    for place in places:
        if x in place:
            # add 'PLACE' if it matches
            res.append(('PLACE', x))

    if ('PLACE', x) not in res: 
        # add '0' if we find nothing
        res.append(('0', x))

print(res)

好的,我根据您的修改更新了我的答案:

from functools import reduce

sent = "terras ipsius Azar vocatas Ta Ta Zagra Ta Zagra Xellule et Ginen Chagem in contrata Deyr Issafisaf cum iuribus suis omnibus"
places = [ 'Ras il Huichile', 'Ras il Hued', 'Ta Richardu', 'Roma', 'Russilion', 'La Rukiha', 'Irrukiha ta il Bayada',
'Casalis Milleri', 'Ta Sabat', 'Casalis Zebug', 'Ta Zagra', 'Sagra in  Ras il Hued', 'Ta Isalme', 'Ta Xellule', 'Ginen Chagem',
'Deyr Issafisaf']

places_map = {p:[('PLACE', l) for l in p.split()] for p in places}

def find_places(sent, places):
    if len(places) is 0:
        return [('O', l) for l in sent.split()]

    place = places[0]
    remaining_places = places[1:]

    sent_splits = sent.split(place)
    return reduce(lambda a,b:a+places_map[place]+b, [find_places(s, remaining_places) for s in sent_splits])

print(find_places(sent, places))

输出为:

[('O', 'terras'), ('O', 'ipsius'), ('O', 'Azar'), ('O', 'vocatas'), ('O', 'Ta'), ('PLACE', 'Ta'), ('PLACE', 'Zagra'), ('PLACE', 'Ta'), ('PLACE', 'Zagra'), ('O', 'Xellule'), ('O', 'et'), ('PLACE', 'Ginen'), ('PLACE', 'Chagem'), ('O', 'in'), ('O', 'contrata'), ('PLACE', 'Deyr'), ('PLACE', 'Issafisaf'), ('O', 'cum'), ('O', 'iuribus'), ('O', 'suis'), ('O', 'omnibus')]

所以我用递归的方法在句子中找到一个地方,把它改成你想要的格式,然后用剩下的地方递归地对句子的其余部分做这个,最后把它们连接在一起。

仅基于列表推导的建议,推导式爱好者:

sent = ['terras', 'ipsius', 'Azar', 'vocatas', 'Ta', 'Xellule', 'et', 'Ginen', 'Chagem', 'in', 'contrata', 'Deyr', 'Issafisaf']
places = ['Ta Xellule', 'Ginen Chagem', 'Deyr Issafisaf']

p      = [i for place in places for i in place.split()]
result = [('PLACE',word) if word in p else ('O',word) for word in sent]

print(result)
# [('O', 'terras'), ('O', 'ipsius'), ('O', 'Azar'), ('O', 'vocatas'), ('PLACE', 'Ta'),
#  ('PLACE', 'Xellule'), ('O', 'et'), ('PLACE', 'Ginen'), ('PLACE', 'Chagem'), 
#  ('O', 'in'), ('O', 'contrata'), ('PLACE', 'Deyr'), ('PLACE', 'Issafisaf')]

另一种方法是在 places 上使用 join 来创建一个字符串,然后检查该词是否在该字符串中:

sent = ['terras', 'ipsius', 'Azar', 'vocatas', 'Ta', 'Xellule', 'et', 'Ginen', 'Chagem', 'in', 'contrata', 'Deyr', 'Issafisaf']
places = ['Ta Xellule', 'Ginen Chagem', 'Deyr Issafisaf']

newList = [('Places',elem) if elem in " ".join(places) else ('O',elem) for elem in sent]
print(newList)

输出:

[('O', 'terras'), ('O', 'ipsius'), ('O', 'Azar'), ('O', 'vocatas'), ('Places', 'Ta'), ('Places', 'Xellule'), ('O', 'et'), ('Places', 'Ginen'), ('Places', 'Chagem'), ('Places', 'in'), ('O', 'contrata'), ('Places', 'Deyr'), ('Places', 'Issafisaf')]