nltk 模块中的类似方法在不同的机器上会产生不同的结果。为什么?
The similar method from the nltk module produces different results on different machines. Why?
我已经教过一些使用 Python 进行文本挖掘的入门 class 课程,并且 class 在提供的练习文本中尝试了类似的方法。有些学生对 text1.similar() 的结果与其他学生不同。
所有版本等都相同。
有谁知道为什么会出现这些差异?谢谢
命令行使用的代码。
python
>>> import nltk
>>> nltk.download() #here you use the pop-up window to download texts
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>>>> text1.similar("monstrous")
mean part maddens doleful gamesome subtly uncommon careful untoward
exasperate loving passing mouldy christian few true mystifying
imperial modifies contemptible
>>> text2.similar("monstrous")
very heartily so exceedingly remarkably as vast a great amazingly
extremely good sweet
类似方法返回的词条列表因用户而异,它们有很多共同的词,但它们不是完全相同的列表。所有用户都使用相同的 OS,以及相同版本的 python 和 nltk.
我希望这能让问题更清楚。谢谢
简而言之:
当similar()
函数使用Counter字典时,它与python3
如何散列键有关。参见 http://pastebin.com/ysAF6p6h
见How and why is the dictionary hashes different in python2 and python3?
中长:
让我们开始:
from nltk.book import *
这里的导入来自于https://github.com/nltk/nltk/blob/develop/nltk/book.py which import the nltk.text.Text
对象,并将多个语料库读入Text
对象。
例如这是从 nltk.book
:
中读取 text1
变量的方式
>>> import nltk.corpus
>>> from nltk.text import Text
>>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
现在,如果我们转到 https://github.com/nltk/nltk/blob/develop/nltk/text.py#L377 处的 similar()
函数的代码,我们会看到这个初始化,如果它是访问 self._word_context_index
的第一个实例:
def similar(self, word, num=20):
"""
Distributional similarity: find other words which appear in the
same contexts as the specified word; list most similar words first.
:param word: The word used to seed the similarity search
:type word: str
:param num: The number of words to generate (default=20)
:type num: int
:seealso: ContextIndex.similar_words()
"""
if '_word_context_index' not in self.__dict__:
#print('Building word-context index...')
self._word_context_index = ContextIndex(self.tokens,
filter=lambda x:x.isalpha(),
key=lambda s:s.lower())
word = word.lower()
wci = self._word_context_index._word_to_contexts
if word in wci.conditions():
contexts = set(wci[word])
fd = Counter(w for w in wci.conditions() for c in wci[w]
if c in contexts and not w == word)
words = [w for w, _ in fd.most_common(num)]
print(tokenwrap(words))
else:
print("No matches")
所以这将我们指向 nltk.text.ContextIndex
对象,即假设收集具有相似上下文 window 的所有单词并存储它们。文档字符串说:
A bidirectional index between words and their 'contexts' in a text.
The context of a word is usually defined to be the words that occur in
a fixed window around the word; but other definitions may also be used
by providing a custom context function.
默认情况下,如果您调用 similar()
函数,它将使用默认上下文设置初始化 _word_context_index
,即左右标记 window,请参阅 https://github.com/nltk/nltk/blob/develop/nltk/text.py#L40
@staticmethod
def _default_context(tokens, i):
"""One left token and one right token, normalized to lowercase"""
left = (tokens[i-1].lower() if i != 0 else '*START*')
right = (tokens[i+1].lower() if i != len(tokens) - 1 else '*END*')
return (left, right)
从 similar()
函数中,我们看到它遍历存储在 word_context_index 中的上下文中的单词,即 wci = self._word_context_index._word_to_contexts
.
本质上,_word_to_contexts
是一个字典,其中键是语料库中的词,值是来自https://github.com/nltk/nltk/blob/develop/nltk/text.py#L55:
的左右词
self._word_to_contexts = CFD((self._key(w), self._context_func(tokens, i))
for i, w in enumerate(tokens))
这里我们看到它是一个 CFD,它是 nltk.probability.ConditionalFreqDist
object, which does not include smoothing of token probability, see full code at https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L1646。
获得不同结果的唯一可能是当similar()
函数在https://github.com/nltk/nltk/blob/develop/nltk/text.py#L402[=56处循环遍历most_common个单词时=]
鉴于 Counter
对象中的两个键具有相同的计数,具有较低排序散列的单词将首先打印出来,并且键的散列取决于 CPU' s 位大小,参见 http://www.laurentluce.com/posts/python-dictionary-implementation/
整个查找相似词的过程本身就是确定性的,因为:
- corpus/input已修复
Text(gutenberg.words('melville-moby_dick.txt'))
- 每个单词的默认上下文也是固定的,即
self._word_context_index
_word_context_index._word_to_contexts
的条件频率分布的计算是离散的
除了当函数输出 most_common
列表时,当 Counter
值相等时,它会输出给定哈希值的键列表。
在 python2
中,没有理由使用以下代码从 同一台机器 的不同实例获得不同的输出:
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
但在 Python3
中,每次 运行 text1.similar('monstrous')
都会给出不同的输出,参见 http://pastebin.com/ysAF6p6h
这里有一个简单的实验来证明 python2
和 python3
之间古怪的散列差异:
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('barfoo', 1), ('foobar', 1), ('bar', 1), ('foo', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foo', 1), ('barfoo', 1), ('bar', 1), ('foobar', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('bar', 1), ('barfoo', 1), ('foobar', 1), ('foo', 1)]
在您的示例中,还有 40 个其他单词 恰好有一个 上下文与单词 'monstrous'
相同。
在 similar
函数中,一个 Counter
对象用于计算具有相似上下文的单词,然后打印最常见的单词(默认为 20)。由于所有 40 个频率相同,因此顺序可能不同。
来自Counter.most_common
的doc:
Elements with equal counts are ordered arbitrarily
我用这段代码检查了相似词的频率(本质上是功能代码相关部分的副本):
from nltk.book import *
from nltk.util import tokenwrap
from nltk.compat import Counter
word = 'monstrous'
num = 20
text1.similar(word)
wci = text1._word_context_index._word_to_contexts
if word in wci.conditions():
contexts = set(wci[word])
fd = Counter(w for w in wci.conditions() for c in wci[w]
if c in contexts and not w == word)
words = [w for w, _ in fd.most_common(num)]
# print(tokenwrap(words))
print(fd)
print(len(fd))
print(fd.most_common(num))
输出:(不同的运行给我不同的输出)
Counter({'doleful': 1, 'curious': 1, 'delightfully': 1, 'careful': 1, 'uncommon': 1, 'mean': 1, 'perilous': 1, 'fearless': 1, 'imperial': 1, 'christian': 1, 'trustworthy': 1, 'untoward': 1, 'maddens': 1, 'true': 1, 'contemptible': 1, 'subtly': 1, 'wise': 1, 'lamentable': 1, 'tyrannical': 1, 'puzzled': 1, 'vexatious': 1, 'part': 1, 'gamesome': 1, 'determined': 1, 'reliable': 1, 'lazy': 1, 'passing': 1, 'modifies': 1, 'few': 1, 'horrible': 1, 'candid': 1, 'exasperate': 1, 'pitiable': 1, 'abundant': 1, 'mystifying': 1, 'mouldy': 1, 'loving': 1, 'domineering': 1, 'impalpable': 1, 'singular': 1})
我已经教过一些使用 Python 进行文本挖掘的入门 class 课程,并且 class 在提供的练习文本中尝试了类似的方法。有些学生对 text1.similar() 的结果与其他学生不同。
所有版本等都相同。
有谁知道为什么会出现这些差异?谢谢
命令行使用的代码。
python
>>> import nltk
>>> nltk.download() #here you use the pop-up window to download texts
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>>>> text1.similar("monstrous")
mean part maddens doleful gamesome subtly uncommon careful untoward
exasperate loving passing mouldy christian few true mystifying
imperial modifies contemptible
>>> text2.similar("monstrous")
very heartily so exceedingly remarkably as vast a great amazingly
extremely good sweet
类似方法返回的词条列表因用户而异,它们有很多共同的词,但它们不是完全相同的列表。所有用户都使用相同的 OS,以及相同版本的 python 和 nltk.
我希望这能让问题更清楚。谢谢
简而言之:
当similar()
函数使用Counter字典时,它与python3
如何散列键有关。参见 http://pastebin.com/ysAF6p6h
见How and why is the dictionary hashes different in python2 and python3?
中长:
让我们开始:
from nltk.book import *
这里的导入来自于https://github.com/nltk/nltk/blob/develop/nltk/book.py which import the nltk.text.Text
对象,并将多个语料库读入Text
对象。
例如这是从 nltk.book
:
text1
变量的方式
>>> import nltk.corpus
>>> from nltk.text import Text
>>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
现在,如果我们转到 https://github.com/nltk/nltk/blob/develop/nltk/text.py#L377 处的 similar()
函数的代码,我们会看到这个初始化,如果它是访问 self._word_context_index
的第一个实例:
def similar(self, word, num=20):
"""
Distributional similarity: find other words which appear in the
same contexts as the specified word; list most similar words first.
:param word: The word used to seed the similarity search
:type word: str
:param num: The number of words to generate (default=20)
:type num: int
:seealso: ContextIndex.similar_words()
"""
if '_word_context_index' not in self.__dict__:
#print('Building word-context index...')
self._word_context_index = ContextIndex(self.tokens,
filter=lambda x:x.isalpha(),
key=lambda s:s.lower())
word = word.lower()
wci = self._word_context_index._word_to_contexts
if word in wci.conditions():
contexts = set(wci[word])
fd = Counter(w for w in wci.conditions() for c in wci[w]
if c in contexts and not w == word)
words = [w for w, _ in fd.most_common(num)]
print(tokenwrap(words))
else:
print("No matches")
所以这将我们指向 nltk.text.ContextIndex
对象,即假设收集具有相似上下文 window 的所有单词并存储它们。文档字符串说:
A bidirectional index between words and their 'contexts' in a text. The context of a word is usually defined to be the words that occur in a fixed window around the word; but other definitions may also be used by providing a custom context function.
默认情况下,如果您调用 similar()
函数,它将使用默认上下文设置初始化 _word_context_index
,即左右标记 window,请参阅 https://github.com/nltk/nltk/blob/develop/nltk/text.py#L40
@staticmethod
def _default_context(tokens, i):
"""One left token and one right token, normalized to lowercase"""
left = (tokens[i-1].lower() if i != 0 else '*START*')
right = (tokens[i+1].lower() if i != len(tokens) - 1 else '*END*')
return (left, right)
从 similar()
函数中,我们看到它遍历存储在 word_context_index 中的上下文中的单词,即 wci = self._word_context_index._word_to_contexts
.
本质上,_word_to_contexts
是一个字典,其中键是语料库中的词,值是来自https://github.com/nltk/nltk/blob/develop/nltk/text.py#L55:
self._word_to_contexts = CFD((self._key(w), self._context_func(tokens, i))
for i, w in enumerate(tokens))
这里我们看到它是一个 CFD,它是 nltk.probability.ConditionalFreqDist
object, which does not include smoothing of token probability, see full code at https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L1646。
获得不同结果的唯一可能是当similar()
函数在https://github.com/nltk/nltk/blob/develop/nltk/text.py#L402[=56处循环遍历most_common个单词时=]
鉴于 Counter
对象中的两个键具有相同的计数,具有较低排序散列的单词将首先打印出来,并且键的散列取决于 CPU' s 位大小,参见 http://www.laurentluce.com/posts/python-dictionary-implementation/
整个查找相似词的过程本身就是确定性的,因为:
- corpus/input已修复
Text(gutenberg.words('melville-moby_dick.txt'))
- 每个单词的默认上下文也是固定的,即
self._word_context_index
_word_context_index._word_to_contexts
的条件频率分布的计算是离散的
除了当函数输出 most_common
列表时,当 Counter
值相等时,它会输出给定哈希值的键列表。
在 python2
中,没有理由使用以下代码从 同一台机器 的不同实例获得不同的输出:
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
$ python
>>> from nltk.book import *
>>> text1.similar('monstrous')
>>> exit()
但在 Python3
中,每次 运行 text1.similar('monstrous')
都会给出不同的输出,参见 http://pastebin.com/ysAF6p6h
这里有一个简单的实验来证明 python2
和 python3
之间古怪的散列差异:
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foobar', 1), ('foo', 1), ('bar', 1), ('barfoo', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('barfoo', 1), ('foobar', 1), ('bar', 1), ('foo', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('foo', 1), ('barfoo', 1), ('bar', 1), ('foobar', 1)]
alvas@ubi:~$ python3 -c "from collections import Counter; x = Counter({'foo': 1, 'bar': 1, 'foobar': 1, 'barfoo': 1}); print(x.most_common())"
[('bar', 1), ('barfoo', 1), ('foobar', 1), ('foo', 1)]
在您的示例中,还有 40 个其他单词 恰好有一个 上下文与单词 'monstrous'
相同。
在 similar
函数中,一个 Counter
对象用于计算具有相似上下文的单词,然后打印最常见的单词(默认为 20)。由于所有 40 个频率相同,因此顺序可能不同。
来自Counter.most_common
的doc:
Elements with equal counts are ordered arbitrarily
我用这段代码检查了相似词的频率(本质上是功能代码相关部分的副本):
from nltk.book import *
from nltk.util import tokenwrap
from nltk.compat import Counter
word = 'monstrous'
num = 20
text1.similar(word)
wci = text1._word_context_index._word_to_contexts
if word in wci.conditions():
contexts = set(wci[word])
fd = Counter(w for w in wci.conditions() for c in wci[w]
if c in contexts and not w == word)
words = [w for w, _ in fd.most_common(num)]
# print(tokenwrap(words))
print(fd)
print(len(fd))
print(fd.most_common(num))
输出:(不同的运行给我不同的输出)
Counter({'doleful': 1, 'curious': 1, 'delightfully': 1, 'careful': 1, 'uncommon': 1, 'mean': 1, 'perilous': 1, 'fearless': 1, 'imperial': 1, 'christian': 1, 'trustworthy': 1, 'untoward': 1, 'maddens': 1, 'true': 1, 'contemptible': 1, 'subtly': 1, 'wise': 1, 'lamentable': 1, 'tyrannical': 1, 'puzzled': 1, 'vexatious': 1, 'part': 1, 'gamesome': 1, 'determined': 1, 'reliable': 1, 'lazy': 1, 'passing': 1, 'modifies': 1, 'few': 1, 'horrible': 1, 'candid': 1, 'exasperate': 1, 'pitiable': 1, 'abundant': 1, 'mystifying': 1, 'mouldy': 1, 'loving': 1, 'domineering': 1, 'impalpable': 1, 'singular': 1})