必须使用 *unicode* 字符串作为要标记的文本,同时使用 TreeTagger 进行标记?
Must use *unicode* string as text to tag, while tagging with TreeTagger?
来自 TreeTagger's website I created a directory and downloaded the specified files. Then treetaggerwrapper, thus from the documentation 我尝试测试并尝试如何标记一些文本,如下所示:
In [40]:
import treetaggerwrapper
tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
tags = tagger.TagText("This is a very short text to tag.")
print tags
然后我收到以下警告:
WARNING:TreeTagger:Abbreviation file not found: english-abbreviations
WARNING:TreeTagger:Processing without abbreviations file.
ERROR:TreeTagger:Must use *unicode* string as text to tag, not <type 'str'>.
---------------------------------------------------------------------------
TreeTaggerError Traceback (most recent call last)
<ipython-input-40-37b912126580> in <module>()
1 import treetaggerwrapper
2 tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
----> 3 tags = tagger.TagText("This is a very short text to tag.")
4 print tags
/usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in TagText(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, encoding, errors)
1236 return self.tag_text(text, numlines=numlines, tagonly=tagonly,
1237 prepronly=prepronly, tagblanks=tagblanks, notagurl=notagurl,
-> 1238 notagemail=notagemail, notagip=notagip, notagdns=notagdns)
1239
1240 # --------------------------------------------------------------------------
/usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in tag_text(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, nosgmlsplit)
1302 # Raise exception now, with an explicit message.
1303 logger.error("Must use *unicode* string as text to tag, not %s.", type(text))
-> 1304 raise TreeTaggerError("Must use *unicode* string as text to tag.")
1305
1306 if isinstance(text, six.text_type):
TreeTaggerError: Must use *unicode* string as text to tag.
在哪里可以下载英语和西班牙语的缩写文件?如何正确安装 treetaggerwrapper?
该方法只接受 unicode 字符串 向您的字符串添加 u
使其成为 unicode 字符串:
tags = tagger.TagText(u"This is a very short text to tag.")
"This is a very short text to tag."
是一个 str 类型 ,一旦你添加 u
它就是 unicode:
In [12]: type("This is a very short text to tag.")
Out[12]: str
In [13]: type(u"This is a very short text to tag.")
Out[13]: unicode
如果您从其他来源获取 str,则需要解码:
In [15]: s = "This is a very short text to tag."
In [16]: type(s)
Out[16]: str
In [17]: type(s.decode("utf-8"))
Out[17]: unicode
可以下载标记脚本here
来自 TreeTagger's website I created a directory and downloaded the specified files. Then treetaggerwrapper, thus from the documentation 我尝试测试并尝试如何标记一些文本,如下所示:
In [40]:
import treetaggerwrapper
tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
tags = tagger.TagText("This is a very short text to tag.")
print tags
然后我收到以下警告:
WARNING:TreeTagger:Abbreviation file not found: english-abbreviations
WARNING:TreeTagger:Processing without abbreviations file.
ERROR:TreeTagger:Must use *unicode* string as text to tag, not <type 'str'>.
---------------------------------------------------------------------------
TreeTaggerError Traceback (most recent call last)
<ipython-input-40-37b912126580> in <module>()
1 import treetaggerwrapper
2 tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
----> 3 tags = tagger.TagText("This is a very short text to tag.")
4 print tags
/usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in TagText(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, encoding, errors)
1236 return self.tag_text(text, numlines=numlines, tagonly=tagonly,
1237 prepronly=prepronly, tagblanks=tagblanks, notagurl=notagurl,
-> 1238 notagemail=notagemail, notagip=notagip, notagdns=notagdns)
1239
1240 # --------------------------------------------------------------------------
/usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in tag_text(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, nosgmlsplit)
1302 # Raise exception now, with an explicit message.
1303 logger.error("Must use *unicode* string as text to tag, not %s.", type(text))
-> 1304 raise TreeTaggerError("Must use *unicode* string as text to tag.")
1305
1306 if isinstance(text, six.text_type):
TreeTaggerError: Must use *unicode* string as text to tag.
在哪里可以下载英语和西班牙语的缩写文件?如何正确安装 treetaggerwrapper?
该方法只接受 unicode 字符串 向您的字符串添加 u
使其成为 unicode 字符串:
tags = tagger.TagText(u"This is a very short text to tag.")
"This is a very short text to tag."
是一个 str 类型 ,一旦你添加 u
它就是 unicode:
In [12]: type("This is a very short text to tag.")
Out[12]: str
In [13]: type(u"This is a very short text to tag.")
Out[13]: unicode
如果您从其他来源获取 str,则需要解码:
In [15]: s = "This is a very short text to tag."
In [16]: type(s)
Out[16]: str
In [17]: type(s.decode("utf-8"))
Out[17]: unicode
可以下载标记脚本here