Java API 查询 CommonCrawl 以填充数字对象标识符 (DOI) 数据库

Question

我正在尝试创建在 Internet 上找到的数字对象标识符 (DOI) 的数据库。

通过手动搜索 CommonCrawl 索引服务器，我获得了一些有希望的结果。

但是我希望开发一个程序化的解决方案。

这可能会导致我的进程只需要读取索引文件而不是底层 WARC 数据文件。

我希望自动化的手动步骤是：-

1).对于每个 CommonCrawl 个当前可用的索引集合：

2).我搜索...“Search a url in this collection: (Wildcards -- Prefix: http://example.com/* Domain: *.example.com)”例如link.springer.com/*

3). returns 将近 6MB 的 json 数据，其中包含大约 22K 个唯一的 DOI。

如何浏览所有可用的CommonCrawl索引而不是搜索特定的URL？

通过阅读 CommonCrawl 的 API 文档，我看不出如何浏览所有索引以提取所有域的所有 DOI。

更新

我找到了这个例子 java 代码 https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java

显示如何访问公共爬网数据集。

然而，当我运行它时，我收到此异常

"main" org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 404, ResponseStatus: Not Found, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>common-crawl/crawl-data/CC-MAIN-2016-26/segments/1466783399106.96/warc/CC-MAIN-20160624154959-00160-ip-10-164-35-72.ec2.internal.warc.gz</Key><RequestId>1FEFC14E80D871DE</RequestId><HostId>yfmhUAwkdNeGpYPWZHakSyb5rdtrlSMjuT5tVW/Pfu440jvufLuuTBPC25vIPDr4Cd5x4ruSCHQ=</HostId></Error>

事实上，我尝试读取的每个文件都会导致相同的错误。这是为什么？

他们数据集的正确通用抓取 uri 是什么？

Answer 1

数据集位置自一年多以来发生了变化，请参阅announcement. However, many examples and libraries still contain the old pointers. You can access the index files for all crawls back to 2013 on s3://commoncrawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/cdx-00xxx.gz - replace YYYY-WW with year and week of the crawle and expand xxx to 000-299 to get all 300 index parts. New crawl data is announced on the Common Crawl group, or read more about how to access the data。

Answer 2

要使示例代码正常工作，请将第 24 和 25 行替换为：

String fn = "crawl-data/CC-MAIN-2013-48/segments/1386163035819/warc/CC-MAIN-20131204131715-00000-ip-10-33-133-15.ec2.internal.warc.gz";
S3Object f = s3s.getObject("commoncrawl", fn, null, null, null, null, null, null);

另请注意，commoncrawl 组有 an updated example。

Java API 查询 CommonCrawl 以填充数字对象标识符 (DOI) 数据库

Java API to query CommonCrawl to populate Digital Object Identifier (DOI) Database

web-scraping

common-crawl