下载 Common crawl 完整索引文件

Download Common crawl complete index file

下面项目中使用的常用爬虫索引文件

mmap = BotoMap(s3_anon, src_bucket, '/common-crawl/projects/url-index/url-index.1356128792')

我想在我的项目中使用完整的索引文件（APRIL-2015 爬网数据），该项目以上述项目为基础。

在哪里可以下载整个索引文件？

Here Tom Morris 说

The index files which are used by the index service are also available for download.

公共抓取索引文件可在 s3://commoncrawl/cc-index/collections/

上公开获得

您可以通过aws命令行查看所有可用的爬网索引：aws s3 ls s3://commoncrawl/cc-index/collections/

2015 年 4 月的索引文件位于 s3://commoncrawl/cc-index/collections/CC-MAIN-2015-18/indexes/

如果你想通过http协议下载索引*.gz个文件，你可以这样做：

https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2015-18/indexes/cdx-00000.gz

cdx文件大部分是从cdx-00000.gz到cdx-00299.gz，所以索引齐全包含在 300 个文件中。