未考虑覆盖 CorpusView.read_block()

Overriding of CorpusView.read_block() not taken into account

我想使用 NLTK 处理一堆文本文件,将它们拆分为特定的关键字。因此,我正在尝试“子类 StreamBackedCorpusView,并按照建议 by the documentation.

覆盖 read_block() 方法
class CustomCorpusView(StreamBackedCorpusView):

    def read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

class CustomCorpusReader(PlaintextCorpusReader):
    CorpusView = CustomCorpusViewer

然而我的继承知识生疏了,似乎没有考虑到我的覆盖。

的输出
corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.words())

的输出相同
corpus = PlaintextCorpusReader("/path/to/files", ".*")

print(corpus.words())

我想我遗漏了一些明显的东西,但是什么?

The documentation 实际上建议了两种定义自定义语料库视图的方法:

  1. Call the StreamBackedCorpusView constructor, and provide your block reader function via the block_reader argument.
  2. Subclass StreamBackedCorpusView, and override the read_block() method.

它还表明第一种方法更简单,实际上我设法让它按以下方式工作:

from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.api import *

class CustomCorpusReader(PlaintextCorpusReader):

    def _custom_read_block(self, stream):
        block = stream.readline().split()
        print("wtf")
        return [] # obviously this is only for debugging

    def custom(self, fileids=None):
        return concat(
            [
                self.CorpusView(fileid, self._custom_read_block, encoding=enc)
                for (fileid, enc) in self.abspaths(fileids, True)
            ]
        )


corpus = CustomCorpusReader("/path/to/files/", ".*")

print(corpus.custom())