如何从网站收集联系信息？

How to collect contact information from websites?

有谁知道用于从网站收集联系方式的网络爬虫工具？假设我有一个 www.website/contact.. 我想提取地址、phone 号码等。我一直在寻找 2 个工具：cralwer4j opensource jar for java 和 Scrapy 在 Python 中开源。但是我发现它有点难以用于我的场景。

任何建议都很好。谢谢

您可能 google 为“simple web crawler”找到最适合您的解决方案。在网络中有很多基于 "pure python" 的网络爬虫。基于 sceleton 代码，您添加 db wrap up。我认为最大的问题是数据库设置和保存数据。

What if there are 1000000s of websites to crawl.. Is there a way to crawl all websites in my are?

编写脚本没问题。只需将数百万个地址放入一个（或多个）文件中，打开它以在 python 或其他脚本中读取。然后从中得到 link link 和 crawl/scrape 到你的乐趣。您可能还想将结果保存在文件中 (csv, json)。

如何从网站收集联系信息？

How to collect contact information from websites?

web-crawler

google-crawlers

scrapy

web-scraping

crawler4j