newsplease commoncrawl.py 文件中的异常

Question

我正在使用从 https://github.com/fhamborg/news-please 克隆的 newsplease 库。我想使用 newsplease 从 commoncrawl 新闻数据集中获取新闻文章。我是运行 commoncrawl.py 文件作为指示 here。我使用了下面的命令 -

python -m newsplease.examples.commoncrawl

在执行以下命令时出现以下错误 -

my_local_download_dir_warc=./cc_download_warc/
my_local_download_dir_article=./cc_download_articles/
delete_warc_after_extraction=False
my_number_of_extraction_processes=1
INFO:newsplease.crawler.commoncrawl_crawler:executing: aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request > .tmpaws.txt && awk '{ print  }' .tmpaws.txt && rm .tmpaws.txt
INFO:newsplease.crawler.commoncrawl_crawler:found 2 files at commoncrawl.org
INFO:newsplease.crawler.commoncrawl_crawler:creating extraction process pool with 1 processes
INFO:newsplease.crawler.commoncrawl_extractor:found local file ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2F, not downloading again due to configuration
Traceback (most recent call last):
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 236, in _detect_type_load_headers
    rec_headers = self.arc_parser.parse(stream, statusline)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 312, in parse
    raise StatusAndHeadersParserException(msg, parts)
warcio.statusandheaders.StatusAndHeadersParserException: Wrong # of headers, expected arc headers ['uri', 'ip-address', 'archive-date', 'content-type', 'length'], Found ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 172, in <module>
    main()
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 168, in main
    continue_process=True)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 320, in crawl_from_commoncrawl
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 230, in __start_commoncrawl_extractor
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 338, in extract_from_commoncrawl
    self.__run()
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 292, in __run
    self.__process_warc_gz_file(local_path_name)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 231, in __process_warc_gz_file
    for record in ArchiveIterator(stream):
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 262, in _next_record
    self.check_digests)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 88, in parse_record_stream
    known_format))
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 243, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Unknown archive format, first line: ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']

这是什么错误，我该如何解决。

https://github.com/fhamborg/news-please表示采用中的config部分 newsplease/examples/commoncrawl.py。这是什么意思？
我已经从这个文件中复制配置并粘贴到 config.cfg 存在于 newsplease/config 目录中。这是他们的指示吗？或者我在这里犯了一个错误。

我正在使用 python 3.6。我的机器上只安装了一个 python。

Answer 1

这个错误是因为 newsplease 使用的库。当我们手动安装每个库时会犯错误，同时安装重点放在包的版本上。 setup.py 文件中给出了每个库的版本信息。安装 setup.py 文件中给出的确切版本。现在执行 setup.py.

时可能会出现问题

所以使用这个命令-

python3 setup.py install

如果您需要卸载所有以前版本的已安装软件包，那么运行 -

pip3 freeze --user | xargs pip3 uninstall -y

了解更多执行此操作的方法click here

newsplease commoncrawl.py 文件中的异常

exception in newsplease commoncrawl.py file

python

web-crawler

common-crawl

python-newspaper

newspaper3k