从默认 ~/ntlk_data 更改 nltk.download() 路径目录

Change nltk.download() path directory from default ~/ntlk_data

我试图在计算服务器上 download/update python nltk 包,但它返回了这个 [Errno 122] Disk quota exceeded: 错误。

具体来说:

[nltk_data] Downloading package stop words to /home/sh2264/nltk_data...
[nltk_data] Error downloading u'stopwords' from
[nltk_data] <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data] pages/packages/corpora/stopwords.zip>: [Errno 122]
[nltk_data] Disk quota exceeded:
[nltk_data] u'/home/sh2264/nltk_data/corpora/stopwords.zip
False

如何更改 nltk 包的整个路径,以及我应该进行哪些其他更改以确保 nltk 的无误加载?

根据 documentation:

By default, packages are installed in either a system-wide directory (if Python has sufficient access to write to it); or in the current user’s home directory. However, the download_dir argument may be used to specify a different installation target, if desired.

指定下载目录,例如:

nltk.download('treebank', download_dir='/mnt/data/treebank')

这可以通过命令行(nltk.download(..., download_dir=) 或 GUI 进行配置。奇怪的是 nltk 似乎完全忽略了它自己的环境变量 NLTK_DATA 并将其下载目录默认为一组标准的五个路径,不管 NLTK_DATA 是否被定义以及它指向哪里,也不管机器或架构上是否存在 nltk 的五个默认目录(!)。其中一些记录在 Installing NLTK Data 中,尽管它不完整并且有点埋没;转载如下,格式更清晰:

Command line installation

The downloader will search for an existing nltk_data directory to install NLTK data. If one does not exist it will attempt to create one in a central location (when using an administrator account) or otherwise in the user’s filespace. If necessary, run the download command from an administrator account, or using sudo. The recommended system location is:

  • C:\nltk_data (Windows) ;
  • /usr/local/share/nltk_data (Mac) and
  • /usr/share/nltk_data (Unix).

You can use the -d flag to specify a different location (but if you do this, be sure to set the NLTK_DATA environment variable accordingly).

  • Run the command python -m nltk.downloader all

  • To ensure central installation, run the command: sudo python -m nltk.downloader -d /usr/local/share/nltk_data all

  • But really they should say: sudo python -m nltk.downloader -d $NLTK_DATA all

现在关于应该使用什么推荐路径NLTK_DATA,nltk并没有真正给出任何正确的指导,但它应该是一个通用的独立路径,不在任何安装树(所以不在 <python-install-directory>/lib/site-packages 下)或任何用户目录。因此,/usr/local/share/opt/share 或类似的。在 MacOS 10.7+ 上,/usr/usr/local/ 这些天默认隐藏,因此 /opt/share 可能是更好的选择。或者做 chflags nohidden /usr/local/share.

NLTK GUI 也可以从 PyCharm 社区版 Python 控制台启动。 只需发出 2 个命令:

1) 导入 nltk

2) nltk.download_gui()

但如果您在控制台的代理服务器后面,nltk GUI 将无法工作,您必须首先设置代理设置

SET HTTP_PROXY=proxy.mycompany.com:8080

然后就可以了。

您也可以使用 nltk.download_shell() 并按照下面显示的交互步骤进行操作。

同时使用nltk.data.path.append('/your/new/data/directory/path')指示nltk从新的数据路径加载数据。