从本地目录读取完整的 penn treebank 数据集

Question

我有一个完整的 penn treebank 数据集，我想使用 ntlk.corpus 中的 ptb 读取它。但是在here里面说：

If you have access to a full installation of the Penn Treebank, NLTK can be configured to load it as well. Download the ptb package, and in the directory nltk_data/corpora/ptb place the BROWN and WSJ directories of the Treebank installation (symlinks work as well). Then use the ptb module instead of treebank:

但我想将数据集保存在本地目录中，然后从那里而不是从 nltk_data/corpora/ptb 加载它。 ptb 总是在该目录中搜索，但我如何给 ptb 一个路径以便它在给定目录中搜索？有什么办法可以做到吗？我在网上彻底搜索并尝试了几种方法，但对我来说没有任何方法！

Answer 1

您可以将语料库文件保存在本地目录中，只需将 nltk_data/corpora 文件夹中的符号链接添加到语料库位置，如您引用的段落所建议的那样。但是，如果您不能修改 nltk_data 或者只是不喜欢在 nltk_data 目录中进行不必要的往返的想法，请继续阅读。

对象 ptb 只是语料库 reader 对象的快捷方式，该对象已使用 Penn Treebank 语料库的适当设置进行了初始化。它是这样定义的（在 nltk/corpus/__init__.py 中）：

ptb = LazyCorpusLoader( # Penn Treebank v3: WSJ and Brown portions
    'ptb', CategorizedBracketParseCorpusReader, r'(WSJ/\d\d/WSJ_\d\d|BROWN/C[A-Z]/C[A-Z])\d\d.MRG',
    cat_file='allcats.txt', tagset='wsj')

LazyCorpusLoader部分可以忽略；使用它是因为 nltk 定义了很多语料库端点，其中大部分从未加载到任何 python 程序中。相反，通过直接实例化 CategorizedBracketParseCorpusReader 来创建语料库 reader。如果您的语料库与 ptb 语料库完全一样，您可以这样称呼它：

from nltk.corpus.reader import CategorizedBracketParseCorpusReader
myreader = CategorizedBracketParseCorpusReader(r"<path to your corpus>", 
    r'(WSJ/\d\d/WSJ_\d\d|BROWN/C[A-Z]/C[A-Z])\d\d.MRG', 
    cat_file='allcats.txt', tagset='wsj')

如您所见，您提供了文件真实位置的路径，并将其余参数保持不变：它们是要包含在语料库中的文件名的正则表达式，一个将语料库文件映射到类别的文件，和要使用的标签集。您创建的对象将与 ptb 或 treebank 完全相同的语料库 reader（除了它不是延迟创建的）。

从本地目录读取完整的 penn treebank 数据集

Read complete penn treebank dataset from local directory

python

nltk

penn-treebank