lambda 找不到通过 AWS CodeBuild 下载的 NLTK 数据
lambda doesn't find the NLTK data downloaded via AWS CodeBuild
我有一个在 lambda 服务上使用 NLTK 的脚本。我使用管道自动执行所有开发步骤。当 GitHub 存储库上发生新提交时,AWS CodeBuild 会处理该项目并在我的 Lambda 函数上实施它。
剧本
- 环境:Python3.6.5
- 将 nltk 与停用词和 wordnet 包一起使用
我将此解决方案用于我的代码:
version: 0.2
phases:
install:
commands:
- echo "install step"
- apt-get update
- apt-get install zip -y
- apt-get install python3-pip -y
- pip install --upgrade pip
- pip install --upgrade awscli
# Define directories
- export HOME_DIR=`pwd`
- export NLTK_DATA=$HOME_DIR/nltk_data
pre_build:
commands:
- echo "pre_build step"
- cd $HOME_DIR
- virtualenv venv
- . venv/bin/activate
# Install modules
- pip install -U requests
# NLTK download
- pip install -U nltk
- python -m nltk.downloader -d $NLTK_DATA wordnet stopwords
- pip freeze > requirements.txt
build:
commands:
- echo 'build step'
- cd $HOME_DIR
- mv $VIRTUAL_ENV/lib/python3.6/site-packages/* .
- sudo zip -r9 algo.zip .
- aws s3 cp --recursive --acl public-read ./ s3://hilightalgo/
# Put the zip on the lambda function
- aws lambda update-function-code --function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight --zip-file fileb://algo.zip
post_build:
commands:
- echo "Build: end"
不同的步骤效果很好。没有错误,但是当我尝试使用我的 Lambda 函数时,似乎我没有 nltk 数据。
看下面lambda执行的结果:
{"errorMessage":"\n**********************************************************************\n Resource \u001b[93mstopwords\u001b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \u001b[31m>>> import nltk\n >>> nltk.download('stopwords')\n \u001b[0m\n Attempted to load \u001b[93mcorpora/stopwords\u001b[0m\n\n Searched in:\n - '/home/sbx_user1060/nltk_data'\n - '/var/lang/nltk_data'\n - '/var/lang/share/nltk_data'\n - '/var/lang/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n**********************************************************************\n","errorType":"LookupError","stackTrace":[" File \"/var/task/lambda_function.py\", line 13, in lambda_handler\n return preprocessing.find_sentences('twitter.txt', 'english')\n"," File \"./hilight_aglo_v2/preprocessing.py\", line 100, in find_sentences\n (data, data_stopwords) = sentence_tokenize(file, language)\n"," File \"./hilight_aglo_v2/preprocessing.py\", line 52, in sentence_tokenize\n stop_words = set(stopwords.words(language))\n"," File \"/var/task/nltk/corpus/util.py\", line 123, in __getattr__\n self.__load()\n"," File \"/var/task/nltk/corpus/util.py\", line 88, in __load\n raise e\n"," File \"/var/task/nltk/corpus/util.py\", line 83, in __load\n root = nltk.data.find('{}/{}'.format(self.subdir, self.__name))\n"," File \"/var/task/nltk/data.py\", line 699, in find\n raise LookupError(resource_not_found)\n"]}
我不知道为什么lambda 找不到nltk 数据。有没有人有办法解决我的问题?
根据错误消息,NLTK 在这些目录中搜索语料库:
Searched in:
- '/home/sbx_user1060/nltk_data'
- '/var/lang/nltk_data'
- '/var/lang/share/nltk_data'
- '/var/lang/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
但是,在Lambda执行环境中,对文件系统的访问有些受限;这些甚至可能不存在,更不用说您的代码可读了。此外,您的代码(您创建的 .zip 存档)被提取到 /var/task
。这基本上就是主目录。
幸运的是,it seems 您可以通过设置环境变量让 nltk
知道在哪里寻找语料库。如果我正确理解您的构建过程,您将 NLTK 语料库捆绑到一个子目录 nltk_data
中,紧挨着您的 python 代码和所需的库。所以在Lambda执行环境中,会在/var/task/nltk_data
.
找到
因此,请尝试在 CodeBuild 过程结束时为您的函数设置 NLTK_DATA
环境变量:
aws lambda update-function-configuration \
--function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight \
--environment 'Variables={NLTK_DATA=/var/task/nltk_data}'
我有一个在 lambda 服务上使用 NLTK 的脚本。我使用管道自动执行所有开发步骤。当 GitHub 存储库上发生新提交时,AWS CodeBuild 会处理该项目并在我的 Lambda 函数上实施它。
剧本
- 环境:Python3.6.5
- 将 nltk 与停用词和 wordnet 包一起使用
我将此解决方案用于我的代码:
version: 0.2
phases:
install:
commands:
- echo "install step"
- apt-get update
- apt-get install zip -y
- apt-get install python3-pip -y
- pip install --upgrade pip
- pip install --upgrade awscli
# Define directories
- export HOME_DIR=`pwd`
- export NLTK_DATA=$HOME_DIR/nltk_data
pre_build:
commands:
- echo "pre_build step"
- cd $HOME_DIR
- virtualenv venv
- . venv/bin/activate
# Install modules
- pip install -U requests
# NLTK download
- pip install -U nltk
- python -m nltk.downloader -d $NLTK_DATA wordnet stopwords
- pip freeze > requirements.txt
build:
commands:
- echo 'build step'
- cd $HOME_DIR
- mv $VIRTUAL_ENV/lib/python3.6/site-packages/* .
- sudo zip -r9 algo.zip .
- aws s3 cp --recursive --acl public-read ./ s3://hilightalgo/
# Put the zip on the lambda function
- aws lambda update-function-code --function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight --zip-file fileb://algo.zip
post_build:
commands:
- echo "Build: end"
不同的步骤效果很好。没有错误,但是当我尝试使用我的 Lambda 函数时,似乎我没有 nltk 数据。 看下面lambda执行的结果:
{"errorMessage":"\n**********************************************************************\n Resource \u001b[93mstopwords\u001b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \u001b[31m>>> import nltk\n >>> nltk.download('stopwords')\n \u001b[0m\n Attempted to load \u001b[93mcorpora/stopwords\u001b[0m\n\n Searched in:\n - '/home/sbx_user1060/nltk_data'\n - '/var/lang/nltk_data'\n - '/var/lang/share/nltk_data'\n - '/var/lang/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n**********************************************************************\n","errorType":"LookupError","stackTrace":[" File \"/var/task/lambda_function.py\", line 13, in lambda_handler\n return preprocessing.find_sentences('twitter.txt', 'english')\n"," File \"./hilight_aglo_v2/preprocessing.py\", line 100, in find_sentences\n (data, data_stopwords) = sentence_tokenize(file, language)\n"," File \"./hilight_aglo_v2/preprocessing.py\", line 52, in sentence_tokenize\n stop_words = set(stopwords.words(language))\n"," File \"/var/task/nltk/corpus/util.py\", line 123, in __getattr__\n self.__load()\n"," File \"/var/task/nltk/corpus/util.py\", line 88, in __load\n raise e\n"," File \"/var/task/nltk/corpus/util.py\", line 83, in __load\n root = nltk.data.find('{}/{}'.format(self.subdir, self.__name))\n"," File \"/var/task/nltk/data.py\", line 699, in find\n raise LookupError(resource_not_found)\n"]}
我不知道为什么lambda 找不到nltk 数据。有没有人有办法解决我的问题?
根据错误消息,NLTK 在这些目录中搜索语料库:
Searched in:
- '/home/sbx_user1060/nltk_data'
- '/var/lang/nltk_data'
- '/var/lang/share/nltk_data'
- '/var/lang/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
但是,在Lambda执行环境中,对文件系统的访问有些受限;这些甚至可能不存在,更不用说您的代码可读了。此外,您的代码(您创建的 .zip 存档)被提取到 /var/task
。这基本上就是主目录。
幸运的是,it seems 您可以通过设置环境变量让 nltk
知道在哪里寻找语料库。如果我正确理解您的构建过程,您将 NLTK 语料库捆绑到一个子目录 nltk_data
中,紧挨着您的 python 代码和所需的库。所以在Lambda执行环境中,会在/var/task/nltk_data
.
因此,请尝试在 CodeBuild 过程结束时为您的函数设置 NLTK_DATA
环境变量:
aws lambda update-function-configuration \
--function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight \
--environment 'Variables={NLTK_DATA=/var/task/nltk_data}'