从 python 环境安装 python huggingface 数据集包,无需互联网连接
install python huggingface datasets package without internet connection from python environment
我无法从我的 python 环境访问互联网连接。我想安装这个 library
我也注意到了这个 page,其中包含软件包所需的文件。我通过将该文件复制到我的 python 环境然后 运行 下面的代码
安装了那个包
pip install 'datasets_package/datasets-1.18.3.tar.gz'
Successfully installed datasets-1.18.3 dill-0.3.4 fsspec-2022.1.0 multiprocess-0.70.12.2 pyarrow-6.0.1 xxhash-2.0.2
但是当我尝试下面的代码时
import datasets
datasets.load_dataset('imdb', split =['train', 'test'])
它抛出错误
ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.18.3/datasets/imdb/imdb.py (error 403)
我可以从我的 python 环境之外访问文件 https://raw.githubusercontent.com/huggingface/datasets/1.18.3/datasets/imdb/imdb.py
我应该复制哪些文件以及我应该进行哪些其他代码更改才能使该行正常工作datasets.load_dataset('imdb', split =['train', 'test'])
?
#更新1=====================
我遵循以下建议并在我的 python 环境中复制了以下文件。所以
os.listdir('huggingface_imdb_data/')
['dummy_data.zip',
'dataset_infos.json',
'imdb.py',
'README.md',
'aclImdb_v1.tar.gz']
最后一个文件来自http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
,其他文件来自github.com/huggingface/datasets/tree/master/datasets/imdb
然后我试了
import datasets
#datasets.load_dataset('imdb', split =['train', 'test'])
datasets.load_dataset('huggingface_imdb_data/aclImdb_v1.tar.gz')
但我收到以下错误:(
HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/datasets/huggingface_imdb_data/aclImdb_v1.tar.gz?full=true
我也试过了
datasets.load_from_disk('huggingface_imdb_data/aclImdb_v1.tar.gz')
但出现错误
FileNotFoundError: Directory huggingface_imdb_data/aclImdb_v1.tar.gz is neither a dataset directory nor a dataset dict directory.
不幸的是,方法 1 不起作用,因为尚不支持:https://github.com/huggingface/datasets/issues/761
Method 1.: You should use the data_files
parameter of the
datasets.load_dataset
function, and provide the path to your local
datafile. See the documentation:
https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset
datasets.load_dataset
Parameters
...
data_dir (str, optional) – Defining the data_dir of the dataset configuration.
data_files (str or Sequence or Mapping, optional) – Path(s) to source data file(s).
...
Update 1.: You should use something like this:
datasets.load_dataset('imdb', split =['train', 'test'], data_files='huggingface_imdb_data/aclImdb_v1.tar.gz')
方法二:
或查看此讨论:https://github.com/huggingface/datasets/issues/824#issuecomment-758358089
>here is my way to load a dataset offline, but it requires an online machine
(online machine)
import datasets
data = datasets.load_dataset(...)
data.save_to_disk('./saved_imdb')
>copy the './saved_imdb' dir to the offline machine
(offline machine)
import datasets
data = datasets.load_from_disk('./saved_imdb')
我无法从我的 python 环境访问互联网连接。我想安装这个 library
我也注意到了这个 page,其中包含软件包所需的文件。我通过将该文件复制到我的 python 环境然后 运行 下面的代码
安装了那个包pip install 'datasets_package/datasets-1.18.3.tar.gz'
Successfully installed datasets-1.18.3 dill-0.3.4 fsspec-2022.1.0 multiprocess-0.70.12.2 pyarrow-6.0.1 xxhash-2.0.2
但是当我尝试下面的代码时
import datasets
datasets.load_dataset('imdb', split =['train', 'test'])
它抛出错误
ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.18.3/datasets/imdb/imdb.py (error 403)
我可以从我的 python 环境之外访问文件 https://raw.githubusercontent.com/huggingface/datasets/1.18.3/datasets/imdb/imdb.py
我应该复制哪些文件以及我应该进行哪些其他代码更改才能使该行正常工作datasets.load_dataset('imdb', split =['train', 'test'])
?
#更新1=====================
我遵循以下建议并在我的 python 环境中复制了以下文件。所以
os.listdir('huggingface_imdb_data/')
['dummy_data.zip',
'dataset_infos.json',
'imdb.py',
'README.md',
'aclImdb_v1.tar.gz']
最后一个文件来自http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
,其他文件来自github.com/huggingface/datasets/tree/master/datasets/imdb
然后我试了
import datasets
#datasets.load_dataset('imdb', split =['train', 'test'])
datasets.load_dataset('huggingface_imdb_data/aclImdb_v1.tar.gz')
但我收到以下错误:(
HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/datasets/huggingface_imdb_data/aclImdb_v1.tar.gz?full=true
我也试过了
datasets.load_from_disk('huggingface_imdb_data/aclImdb_v1.tar.gz')
但出现错误
FileNotFoundError: Directory huggingface_imdb_data/aclImdb_v1.tar.gz is neither a dataset directory nor a dataset dict directory.
不幸的是,方法 1 不起作用,因为尚不支持:https://github.com/huggingface/datasets/issues/761
Method 1.: You should use the
data_files
parameter of thedatasets.load_dataset
function, and provide the path to your local datafile. See the documentation: https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_datasetdatasets.load_dataset Parameters ... data_dir (str, optional) – Defining the data_dir of the dataset configuration. data_files (str or Sequence or Mapping, optional) – Path(s) to source data file(s). ...
Update 1.: You should use something like this:
datasets.load_dataset('imdb', split =['train', 'test'], data_files='huggingface_imdb_data/aclImdb_v1.tar.gz')
方法二:
或查看此讨论:https://github.com/huggingface/datasets/issues/824#issuecomment-758358089
>here is my way to load a dataset offline, but it requires an online machine
(online machine)
import datasets
data = datasets.load_dataset(...)
data.save_to_disk('./saved_imdb')
>copy the './saved_imdb' dir to the offline machine
(offline machine)
import datasets
data = datasets.load_from_disk('./saved_imdb')