如何将 NLTK 块输出到文件?

How to output NLTK chunks to file?

我有这个 python 脚本,我在其中使用 nltk 库来解析、标记、标记和分块一些让我们说来自网络的随机文本。

我需要格式化 chunked1chunked2chunked3 的输出并将其写入文件。这些类型为 class 'nltk.tree.Tree'

更具体地说,我只需要编写与正则表达式 chunkGram1chunkGram2chunkGram3.

匹配的行

我该怎么做?

#! /usr/bin/python2.7

import nltk
import re
import codecs

xstring = ["An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."]


def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        #print tokenized
        #print tagged

        chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
        chunkGram2 = r"""Chunk: {<JJ\w?>*<NNS>}"""
        chunkGram3 = r"""Chunk: {<NNP\w?>*<NNS>}"""

        chunkParser1 = nltk.RegexpParser(chunkGram1)
        chunked1 = chunkParser1.parse(tagged)

        chunkParser2 = nltk.RegexpParser(chunkGram2)
        chunked2 = chunkParser2.parse(tagged)

        chunkParser3 = nltk.RegexpParser(chunkGram3)
        chunked3 = chunkParser2.parse(tagged)

        #print chunked1
        #print chunked2
        #print chunked3

        # with codecs.open('path\to\file\output.txt', 'w', encoding='utf8') as outfile:

            # for i,line in enumerate(chunked1):
                # if "JJ" in line:
                    # outfile.write(line)
                # elif "NNP" in line:
                    # outfile.write(line)



processLanguage()

目前,当我尝试 运行 时,出现错误:

`Traceback (most recent call last):
  File "sentdex.py", line 47, in <module>
    processLanguage()
  File "sentdex.py", line 40, in processLanguage
    outfile.write(line)
  File "C:\Python27\lib\codecs.py", line 688, in write
    return self.writer.write(data)
  File "C:\Python27\lib\codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, tuple found`

编辑: 在@Alvas 回答后我设法做了我想做的事。但是现在,我想知道如何从文本 corpus 中去除所有非 ascii 字符。示例:

#store cleaned file into variable
with open('path\to\file.txt', 'r') as infile:
    xstring = infile.readlines()
infile.close

    def remove_non_ascii(line):
        return ''.join([i if ord(i) < 128 else ' ' for i in line])

    for i, line in enumerate(xstring):
        line = remove_non_ascii(line)

#tokenize and tag text
def processLanguage():
    for item in xstring:
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        print tokenized
        print tagged
processLanguage()

以上内容摘自S/O中的另一个答案。但是它似乎不起作用。可能出了什么问题?我得到的错误是:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
not in range(128)

首先,请看这个视频:https://www.youtube.com/watch?v=0Ef9GudbxXY

现在为正确答案:

import re
import io

from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser


xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."


chunkGram1 = r"""Chunk: {<JJ\w?>*<NN>}"""
chunkParser1 = RegexpParser(chunkGram1)

chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent))) 
            for sent in sent_tokenize(xstring)]

with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(str(chunk)+'\n\n')

[输出]:

alvas@ubi:~$ python test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    fout.write(str(chunk)+'\n\n')
TypeError: must be unicode, not str
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC

如果非要坚持python2.7:

with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(unicode(chunk)+'\n\n')

[输出]:

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    fout.write(unicode(chunk)+'\n\n')
NameError: name 'unicode' is not defined

如果你必须坚持使用 py2.7,强烈推荐:

from six import text_type
with io.open('outfile', 'w', encoding='utf8') as fout:
    for chunk in chunked:
        fout.write(text_type(chunk)+'\n\n')

[输出]:

alvas@ubi:~$ python test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC
alvas@ubi:~$ python3 test2.py
alvas@ubi:~$ head outfile 
(S
  An/DT
  (Chunk electronic/JJ library/NN)
  (/:
  also/RB
  referred/VBD
  to/TO
  as/IN
  (Chunk digital/JJ library/NN)
  or/CC

您的代码有 几个 问题,但主要原因是您的 for 循环没有修改 xstring:

我将在这里解决您代码中的所有问题:

不能用单个 \ 写这样的路径,因为 \t 将被解释为制表符,而 \f 将被解释为换行字符。你必须加倍。我知道这是一个例子,但经常会出现这样的混淆:

with open('path\to\file.txt', 'r') as infile:
    xstring = infile.readlines()

下面的 infile.close 行是 错误的 。它不调用 close 方法,它实际上不做任何事情。此外,如果您在任何地方的任何答案中看到这一行,您的文件 was 已经被 with 子句关闭,请直接否决该答案并评论说 file.close 是错误的,应该是 file.close().

以下内容应该有效,但您需要注意,它将每个非 ascii 字符替换为 ' ' 会破坏 naïve 和 café

等词
def remove_non_ascii(line):
    return ''.join([i if ord(i) < 128 else ' ' for i in line])

但这就是您的代码因 unicode 异常而失败的原因:您根本没有修改 xstring 的元素,也就是说,您正在计算删除了 ascii 字符的行,是的,但是这是一个新值,永远不会存储到列表中:

for i, line in enumerate(xstring):
   line = remove_non_ascii(line)

应该是:

for i, line in enumerate(xstring):
    xstring[i] = remove_non_ascii(line)

或者我喜欢的非常pythonic:

xstring = [ remove_non_ascii(line) for line in xstring ]

虽然这些 Unicode 错误的发生主要是因为您正在使用 Python 2.7 来处理纯 Unicode 文本,最近的 Python 3 版本遥遥领先,因此我建议您如果您刚开始执行任务,您很快就会升级到 Python 3.4+。