如何查看 HuggingFace 数据集的汇总统计信息(例如样本数量;数据类型)?

How can I see summary statistics (e.g. number of samples; type of data) of HuggingFace datasets?

我正在寻找合适的数据集来测试一些新的机器学习想法。有什么方法可以查看 HuggingFace 数据集的汇总统计信息(例如样本数量;数据类型)?

他们在这里提供描述 https://huggingface.co/datasets ,但是过滤它们有点困难。

不确定我是否遗漏了明显的内容,但我认为您必须自己编写代码。当您使用 list_datasets 时,您只会获得每个数据集的一般信息:

from datasets import list_datasets
list_datasets(with_details=True)[1].__dict__

输出:

{'id': 'ag_news',
 'key': 'datasets/datasets/ag_news/ag_news.py',
 'lastModified': '2020-09-15T08:26:31.000Z',
 'description': "AG is a collection of more than 1 million news articles. News articles have been\ngathered from more than 2000 news sources by ComeToMyHead in more than 1 year of\nactivity. ComeToMyHead is an academic news search engine which has been running\nsince July, 2004. The dataset is provided by the academic comunity for research\npurposes in data mining (clustering, classification, etc), information retrieval\n(ranking, search, etc), xml, data compression, data streaming, and any other\nnon-commercial activity. For more information, please refer to the link\nhttp://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .\n\nThe AG's news topic classification dataset is constructed by Xiang Zhang\n(xiang.zhang@nyu.edu) from the dataset above. It is used as a text\nclassification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann\nLeCun. Character-level Convolutional Networks for Text Classification. Advances\nin Neural Information Processing Systems 28 (NIPS 2015).",
 'citation': '@inproceedings{Zhang2015CharacterlevelCN,\n  title={Character-level Convolutional Networks for Text Classification},\n  author={Xiang Zhang and Junbo Jake Zhao and Yann LeCun},\n  booktitle={NIPS},\n  year={2015}\n}',
 'size': 3991,
 'etag': '"560ac59ac8cb6f76ac4180562a7f9342"',
 'siblings': [datasets.S3Object('ag_news.py'),
  datasets.S3Object('dataset_infos.json'),
  datasets.S3Object('dummy/0.0.0/dummy_data.zip')],
 'author': None,
 'numModels': 1}

你真正要找的是load_dataset:

提供的信息
from datasets import load_dataset
squad = load_dataset('squad')
squad

输出:

 DatasetDict({'train': Dataset(features: {'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}, num_rows: 87599), 'validation': Dataset(features: {'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}, num_rows: 10570)})

在这里,您可以获得每个拆分 (num_rows) 的样本数和每个特征的数据类型。但是 load_dataset 将加载整个数据集,这可能是一种不良行为,因此出于性能原因应该被拒绝。

就我没有忽略的一个参数而言,以下是替代方案,该参数仅允许加载每个数据集的 dataset_infos.json

import datasets
import requests
from datasets import list_datasets
from datasets.utils.file_utils import REPO_DATASETS_URL

sets = list_datasets()
version = datasets.__version__
name = 'dataset_infos.json'
summary =[]

for d in sets:
     print('loading {}'.format(d))
     try:
         r = requests.get(REPO_DATASETS_URL.format(version=version, path=d, name=name))
         summary.append(r.json())
     except:
         print('Could not load {}'.format(d))

#the features and splits values are probably interesting for you
print(summary[0]['default']['features'])
print(summary[0]['default']['splits'])

输出:

{'email_body': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'subject_line': {'dtype': 'string', 'id': None, '_type': 'Value'}}
{'test': {'name': 'test', 'num_bytes': 1384177, 'num_examples': 1906, 'dataset_name': 'aeslc'}, 'train': {'name': 'train', 'num_bytes': 11902668, 'num_examples': 14436, 'dataset_name': 'aeslc'}, 'validation': {'name': 'validation', 'num_bytes': 1660730, 'num_examples': 1960, 'dataset_name': 'aeslc'}}

P.S.: 我没有检查未加载的数据集的 dataset_infos.json。他们可能有更复杂的结构或内部错误。