使用 NLTK 对阿拉伯语文本进行词义消歧
Word Sense Disambiguation for Arabic text with NLTK
NLTK 允许我用 nltk.wsd.lesk
消除文本歧义,例如
>>> from nltk.corpus import wordnet as wn
>>> from nltk.wsd import lesk
>>> sent = "I went to the bank to deposit money"
>>> ambiguous = "deposit"
>>> lesk(sent, ambiguous, pos='v')
Synset('deposit.v.02')
PyWSD
作用相同,但仅适用于英文文本。
NLTK 支持来自 Open Multilingual WordNet 的阿拉伯语 wordnet,例如
>>> wn.synsets('deposit', pos='v')[1].lemma_names(lang='arb')
[u'\u0623\u064e\u0648\u0652\u062f\u064e\u0639\u064e']
>>> print wn.synsets('deposit', pos='v')[1].lemma_names(lang='arb')[0]
أَوْدَعَ
此外,同义词集已编入阿拉伯语索引:
>>> wn.synsets(u'أَوْدَعَ', lang='arb')
[Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]
但是我如何使用 nltk 消除阿拉伯语文本的歧义并从查询中提取概念?
我想知道是否可以使用Lesk算法通过nltk处理阿拉伯语文本?
这有点棘手,但也许会奏效:
- 翻译句子和歧义词
- 对英文版句子使用lesk
尝试:
alvas@ubi:~$ wget -O translate.sh http://pastebin.com/raw.php?i=aHgFzmMU
--2015-08-05 23:32:46-- http://pastebin.com/raw.php?i=aHgFzmMU
Resolving pastebin.com (pastebin.com)... 190.93.241.15, 190.93.240.15, 141.101.112.16, ...
Connecting to pastebin.com (pastebin.com)|190.93.241.15|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘translate.sh’
[ <=> ] 212 --.-K/s in 0s
2015-08-05 23:32:47 (9.99 MB/s) - ‘translate.sh’ saved [212]
alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> text = 'لديه يودع المال في البنك'
>>> cmd = 'echo "{}" | bash translate.sh'.format(text)
>>> translation = os.popen(cmd).read()
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 193 0 40 100 153 21 83 0:00:01 0:00:01 --:--:-- 83
>>> translation
'He has deposited the money in the bank. '
>>> ambiguous = u'أَوْدَعَ'
>>> wn.synsets(ambiguous, lang='arb')
[Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]
>>> nltk.wsd.lesk(translation_stems, '', synsets=wn.synsets(ambiguous,lang='arb'))
Synset('entrust.v.02')
但是如您所见,有很多限制:
- 访问 MT 系统并不总是那么容易(上面的 bash 使用 IBM API 的脚本不会永远存在,它来自 https://github.com/Rich-Edwards/fsharpwatson/blob/master/Command%20Line%20CURL%20Scripts)
- 机器翻译永远不会 100% 准确
- 在 Open Multilingual WordNet 中寻找正确的词条并不像示例中显示的那么容易,词干有屈折变化和其他语素变体。
- WordNet 永远不完整,尤其是当它不是英语时。
- WSD 并非人类预期的 100%(即使在人与人之间我们也有所不同 "senses",在上面的示例中,有些人可能会说 WSD 是正确的,有些人会说最好使用
Synset('deposit.v.02')
)
NLTK 允许我用 nltk.wsd.lesk
消除文本歧义,例如
>>> from nltk.corpus import wordnet as wn
>>> from nltk.wsd import lesk
>>> sent = "I went to the bank to deposit money"
>>> ambiguous = "deposit"
>>> lesk(sent, ambiguous, pos='v')
Synset('deposit.v.02')
PyWSD
作用相同,但仅适用于英文文本。
NLTK 支持来自 Open Multilingual WordNet 的阿拉伯语 wordnet,例如
>>> wn.synsets('deposit', pos='v')[1].lemma_names(lang='arb')
[u'\u0623\u064e\u0648\u0652\u062f\u064e\u0639\u064e']
>>> print wn.synsets('deposit', pos='v')[1].lemma_names(lang='arb')[0]
أَوْدَعَ
此外,同义词集已编入阿拉伯语索引:
>>> wn.synsets(u'أَوْدَعَ', lang='arb')
[Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]
但是我如何使用 nltk 消除阿拉伯语文本的歧义并从查询中提取概念?
我想知道是否可以使用Lesk算法通过nltk处理阿拉伯语文本?
这有点棘手,但也许会奏效:
- 翻译句子和歧义词
- 对英文版句子使用lesk
尝试:
alvas@ubi:~$ wget -O translate.sh http://pastebin.com/raw.php?i=aHgFzmMU
--2015-08-05 23:32:46-- http://pastebin.com/raw.php?i=aHgFzmMU
Resolving pastebin.com (pastebin.com)... 190.93.241.15, 190.93.240.15, 141.101.112.16, ...
Connecting to pastebin.com (pastebin.com)|190.93.241.15|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘translate.sh’
[ <=> ] 212 --.-K/s in 0s
2015-08-05 23:32:47 (9.99 MB/s) - ‘translate.sh’ saved [212]
alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> text = 'لديه يودع المال في البنك'
>>> cmd = 'echo "{}" | bash translate.sh'.format(text)
>>> translation = os.popen(cmd).read()
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 193 0 40 100 153 21 83 0:00:01 0:00:01 --:--:-- 83
>>> translation
'He has deposited the money in the bank. '
>>> ambiguous = u'أَوْدَعَ'
>>> wn.synsets(ambiguous, lang='arb')
[Synset('entrust.v.02'), Synset('deposit.v.02'), Synset('commit.v.03'), Synset('entrust.v.01'), Synset('consign.v.02')]
>>> nltk.wsd.lesk(translation_stems, '', synsets=wn.synsets(ambiguous,lang='arb'))
Synset('entrust.v.02')
但是如您所见,有很多限制:
- 访问 MT 系统并不总是那么容易(上面的 bash 使用 IBM API 的脚本不会永远存在,它来自 https://github.com/Rich-Edwards/fsharpwatson/blob/master/Command%20Line%20CURL%20Scripts)
- 机器翻译永远不会 100% 准确
- 在 Open Multilingual WordNet 中寻找正确的词条并不像示例中显示的那么容易,词干有屈折变化和其他语素变体。
- WordNet 永远不完整,尤其是当它不是英语时。
- WSD 并非人类预期的 100%(即使在人与人之间我们也有所不同 "senses",在上面的示例中,有些人可能会说 WSD 是正确的,有些人会说最好使用
Synset('deposit.v.02')
)