未考虑覆盖 CorpusView.read_block()

Question

我想使用 NLTK 处理一堆文本文件，将它们拆分为特定的关键字。因此，我正在尝试“子类 StreamBackedCorpusView，并按照建议 by the documentation.

覆盖 read_block() 方法 ”

class CustomCorpusView(StreamBackedCorpusView):

    def read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

class CustomCorpusReader(PlaintextCorpusReader):
    CorpusView = CustomCorpusViewer

然而我的继承知识生疏了，似乎没有考虑到我的覆盖。

的输出

corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.words())

与

的输出相同

corpus = PlaintextCorpusReader("/path/to/files", ".*")

print(corpus.words())

我想我遗漏了一些明显的东西，但是什么？

Answer 1

The documentation 实际上建议了两种定义自定义语料库视图的方法：

Call the StreamBackedCorpusView constructor, and provide your block reader function via the block_reader argument.

Subclass StreamBackedCorpusView, and override the read_block() method.

它还表明第一种方法更简单，实际上我设法让它按以下方式工作：

from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.api import *

class CustomCorpusReader(PlaintextCorpusReader):

    def _custom_read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

    def custom(self, fileids=None):
        return concat(
            [
                self.CorpusView(fileid, self._custom_read_block, encoding=enc)
                for (fileid, enc) in self.abspaths(fileids, True)
            ]
        )


corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.custom())

未考虑覆盖 CorpusView.read_block()

Overriding of CorpusView.read_block() not taken into account

overriding

subclass

nltk

python-3.x