元组没有属性 'isdigit'
tuple has no attribute 'isdigit'
我需要使用 NLTK 模块进行一些文字处理,但出现此错误:
AttributeError: 'tuple' 对象没有属性 'isdigit'
有人知道如何处理这个错误吗?
Traceback (most recent call last):
File "preprocessing-edit.py", line 36, in <module>
postoks = nltk.tag.pos_tag(tok)
NameError: name 'tok' is not defined
PS C:\Users\moham\Desktop\Presentation> python preprocessing-edit.py
Traceback (most recent call last):
File "preprocessing-edit.py", line 37, in <module>
postoks = nltk.tag.pos_tag(tok)
File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\__init__.py", line 111, in pos_tag
return _pos_tag(tokens, tagset, tagger)
File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\__init__.py", line 82, in _pos_tag
tagged_tokens = tagger.tag(tokens)
File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\perceptron.py", line 153, in tag
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\perceptron.py", line 153, in <listcomp>
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\perceptron.py", line 228, in normalize
elif word.isdigit() and len(word) == 4:
AttributeError: 'tuple' object has no attribute 'isdigit'
import nltk
with open ("SHORT-LIST.txt", "r",encoding='utf8') as myfile:
text = (myfile.read().replace('\n', ''))
#text = "program managment is complicated issue for human workers"
# Used when tokenizing words
sentence_re = r'''(?x) # set flag to allow verbose regexps
([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| $?\d+(\.\d+)?%? # currency and percentages, e.g. .40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens
'''
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)
tok = nltk.regexp_tokenize(text, sentence_re)
postoks = nltk.tag.pos_tag(tok)
#print (postoks)
tree = chunker.parse(postoks)
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'):
yield subtree.leaves()
def normalise(word):
"""Normalises words to lowercase and stems and lemmatizes it."""
word = word.lower()
word = stemmer.stem_word(word)
word = lemmatizer.lemmatize(word)
return word
def acceptable_word(word):
"""Checks conditions for acceptable word: length, stopword."""
accepted = bool(2 <= len(word) <= 40
and word.lower() not in stopwords)
return accepted
def get_terms(tree):
for leaf in leaves(tree):
term = [ normalise(w) for w,t in leaf if acceptable_word(w) ]
yield term
terms = get_terms(tree)
with open("results.txt", "w+") as logfile:
for term in terms:
for word in term:
result = word
logfile.write("%s\n" % str(word))
# print (word),
# (print)
logfile.close()
nltk 3.1 版本中默认标注器为Perceptron。现在是最新版本。我的所有 nltk.regexp_tokenize 都停止正常运行,我的所有 nltk.pos_tag 开始出现上述错误。
我目前的解决办法是使用之前的nltk 3.0.1版本使其正常运行。我不确定这是否是当前版本的 nltk 中的错误。
ubuntu中nltk 3.0.4版本的安装说明。从您的主目录或任何其他目录执行以下步骤。
$ wget https://github.com/nltk/nltk/archive/3.0.4.tar.gz
$ tar -xvzf 3.0.4.tar.gz
$ cd nltk-3.0.4
$ sudo python3.4 setup.py install
另一种简单的方法是更改此部分:
tok = nltk.regexp_tokenize(text, sentence_re)
postoks = nltk.tag.pos_tag(tok)
并将其替换为 nltk 标准单词分词器:
toks = nltk.word_tokenize(text)
postoks = nltk.tag.pos_tag(toks)
从理论上讲,性能和结果应该不会有太大差异。
对于更高版本的 nltk,正则表达式中的更改解决了这个问题。我在 https://gist.github.com/alexbowe/879414#gistcomment-1704727
找到了解决方案
--
使用括号对给定的表达式进行分组,我将所有括号更改为非捕获。
sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*) |(?:\$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"\'?():-_`]) '
--
我需要使用 NLTK 模块进行一些文字处理,但出现此错误: AttributeError: 'tuple' 对象没有属性 'isdigit'
有人知道如何处理这个错误吗?
Traceback (most recent call last):
File "preprocessing-edit.py", line 36, in <module>
postoks = nltk.tag.pos_tag(tok)
NameError: name 'tok' is not defined
PS C:\Users\moham\Desktop\Presentation> python preprocessing-edit.py
Traceback (most recent call last):
File "preprocessing-edit.py", line 37, in <module>
postoks = nltk.tag.pos_tag(tok)
File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\__init__.py", line 111, in pos_tag
return _pos_tag(tokens, tagset, tagger)
File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\__init__.py", line 82, in _pos_tag
tagged_tokens = tagger.tag(tokens)
File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\perceptron.py", line 153, in tag
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\perceptron.py", line 153, in <listcomp>
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "c:\python34\lib\site-packages\nltk-3.1-py3.4.egg\nltk\tag\perceptron.py", line 228, in normalize
elif word.isdigit() and len(word) == 4:
AttributeError: 'tuple' object has no attribute 'isdigit'
import nltk
with open ("SHORT-LIST.txt", "r",encoding='utf8') as myfile:
text = (myfile.read().replace('\n', ''))
#text = "program managment is complicated issue for human workers"
# Used when tokenizing words
sentence_re = r'''(?x) # set flag to allow verbose regexps
([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| $?\d+(\.\d+)?%? # currency and percentages, e.g. .40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens
'''
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)
tok = nltk.regexp_tokenize(text, sentence_re)
postoks = nltk.tag.pos_tag(tok)
#print (postoks)
tree = chunker.parse(postoks)
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'):
yield subtree.leaves()
def normalise(word):
"""Normalises words to lowercase and stems and lemmatizes it."""
word = word.lower()
word = stemmer.stem_word(word)
word = lemmatizer.lemmatize(word)
return word
def acceptable_word(word):
"""Checks conditions for acceptable word: length, stopword."""
accepted = bool(2 <= len(word) <= 40
and word.lower() not in stopwords)
return accepted
def get_terms(tree):
for leaf in leaves(tree):
term = [ normalise(w) for w,t in leaf if acceptable_word(w) ]
yield term
terms = get_terms(tree)
with open("results.txt", "w+") as logfile:
for term in terms:
for word in term:
result = word
logfile.write("%s\n" % str(word))
# print (word),
# (print)
logfile.close()
nltk 3.1 版本中默认标注器为Perceptron。现在是最新版本。我的所有 nltk.regexp_tokenize 都停止正常运行,我的所有 nltk.pos_tag 开始出现上述错误。
我目前的解决办法是使用之前的nltk 3.0.1版本使其正常运行。我不确定这是否是当前版本的 nltk 中的错误。
ubuntu中nltk 3.0.4版本的安装说明。从您的主目录或任何其他目录执行以下步骤。
$ wget https://github.com/nltk/nltk/archive/3.0.4.tar.gz
$ tar -xvzf 3.0.4.tar.gz
$ cd nltk-3.0.4
$ sudo python3.4 setup.py install
另一种简单的方法是更改此部分:
tok = nltk.regexp_tokenize(text, sentence_re)
postoks = nltk.tag.pos_tag(tok)
并将其替换为 nltk 标准单词分词器:
toks = nltk.word_tokenize(text)
postoks = nltk.tag.pos_tag(toks)
从理论上讲,性能和结果应该不会有太大差异。
对于更高版本的 nltk,正则表达式中的更改解决了这个问题。我在 https://gist.github.com/alexbowe/879414#gistcomment-1704727
找到了解决方案--
使用括号对给定的表达式进行分组,我将所有括号更改为非捕获。
sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*) |(?:\$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"\'?():-_`]) '
--