未考虑覆盖 CorpusView.read_block()
Overriding of CorpusView.read_block() not taken into account
我想使用 NLTK 处理一堆文本文件,将它们拆分为特定的关键字。因此,我正在尝试“子类 StreamBackedCorpusView
,并按照建议 by the documentation.
覆盖 read_block()
方法 ”
class CustomCorpusView(StreamBackedCorpusView):
def read_block(self, stream):
block = stream.readline().split()
print("wtf")
return [] # obviously this is only for debugging
class CustomCorpusReader(PlaintextCorpusReader):
CorpusView = CustomCorpusViewer
然而我的继承知识生疏了,似乎没有考虑到我的覆盖。
的输出
corpus = CustomCorpusReader("/path/to/files/", ".*")
print(corpus.words())
与
的输出相同
corpus = PlaintextCorpusReader("/path/to/files", ".*")
print(corpus.words())
我想我遗漏了一些明显的东西,但是什么?
The documentation 实际上建议了两种定义自定义语料库视图的方法:
- Call the StreamBackedCorpusView constructor, and provide your block reader function via the block_reader argument.
- Subclass StreamBackedCorpusView, and override the read_block() method.
它还表明第一种方法更简单,实际上我设法让它按以下方式工作:
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.api import *
class CustomCorpusReader(PlaintextCorpusReader):
def _custom_read_block(self, stream):
block = stream.readline().split()
print("wtf")
return [] # obviously this is only for debugging
def custom(self, fileids=None):
return concat(
[
self.CorpusView(fileid, self._custom_read_block, encoding=enc)
for (fileid, enc) in self.abspaths(fileids, True)
]
)
corpus = CustomCorpusReader("/path/to/files/", ".*")
print(corpus.custom())
我想使用 NLTK 处理一堆文本文件,将它们拆分为特定的关键字。因此,我正在尝试“子类 StreamBackedCorpusView
,并按照建议 by the documentation.
read_block()
方法 ”
class CustomCorpusView(StreamBackedCorpusView):
def read_block(self, stream):
block = stream.readline().split()
print("wtf")
return [] # obviously this is only for debugging
class CustomCorpusReader(PlaintextCorpusReader):
CorpusView = CustomCorpusViewer
然而我的继承知识生疏了,似乎没有考虑到我的覆盖。
的输出corpus = CustomCorpusReader("/path/to/files/", ".*")
print(corpus.words())
与
的输出相同corpus = PlaintextCorpusReader("/path/to/files", ".*")
print(corpus.words())
我想我遗漏了一些明显的东西,但是什么?
The documentation 实际上建议了两种定义自定义语料库视图的方法:
- Call the StreamBackedCorpusView constructor, and provide your block reader function via the block_reader argument.
- Subclass StreamBackedCorpusView, and override the read_block() method.
它还表明第一种方法更简单,实际上我设法让它按以下方式工作:
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus.reader.api import *
class CustomCorpusReader(PlaintextCorpusReader):
def _custom_read_block(self, stream):
block = stream.readline().split()
print("wtf")
return [] # obviously this is only for debugging
def custom(self, fileids=None):
return concat(
[
self.CorpusView(fileid, self._custom_read_block, encoding=enc)
for (fileid, enc) in self.abspaths(fileids, True)
]
)
corpus = CustomCorpusReader("/path/to/files/", ".*")
print(corpus.custom())