编写脚本来抓取站点并在 JSON 中输出其数据?
Write script to scrape a site and output its data in JSON?
我遇到了一个问题。我想获取此站点所有公司列表数据的 JSON。
每个 link 端点都包含公司特定的数据,例如 公司名称、描述、邮政编码、州和地址
我的初步想法是:
- 获取站点列表到列表中
- 可能会再次使用 requests.get 来抓取每个单独的端点
到目前为止我已经尝试了几种方法,这是我最近的尝试:
import requests
from bs4 import BeautifulSoup
base_url = "http://data-interview.enigmalabs.org/companies/"
r = requests.get(base_url)
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all("a")
link_list = []
for link in links:
print link_list.append("<a href='%s'</a>" %(link.get("href")))
我不知道如何从各个页面中提取我需要的所有数据
import bs4, urlparse, json, requests
from os.path import basename as bn
links = [] # the relative paths to companies
data = {} # company name --> company data
base = 'http://data-interview.enigmalabs.org/'
def bs(r): # returns a beautifulsoup table object for the table in the page of the relative path
return bs4.BeautifulSoup(requests.get(urlparse.urljoin(base, r).encode()).content, 'html.parser').find('table')
for i in range(1,11):
print 'Collecting page %d' % i
# add end-point links
links += [a['href'] for a in bs('companies?page=%d' % i).findAll('a')]
for link in links:
print 'Processing %s' % link
name = bn(link)
data[name] = {}
for row in bs(link).findAll('tr'):
desc, cont = row.findAll('td')
data[name][desc.text.encode()] = cont.text.encode()
print json.dumps(data)
我遇到了一个问题。我想获取此站点所有公司列表数据的 JSON。
每个 link 端点都包含公司特定的数据,例如 公司名称、描述、邮政编码、州和地址 我的初步想法是:
- 获取站点列表到列表中
- 可能会再次使用 requests.get 来抓取每个单独的端点
到目前为止我已经尝试了几种方法,这是我最近的尝试:
import requests
from bs4 import BeautifulSoup
base_url = "http://data-interview.enigmalabs.org/companies/"
r = requests.get(base_url)
soup = BeautifulSoup(r.content, 'html.parser')
links = soup.find_all("a")
link_list = []
for link in links:
print link_list.append("<a href='%s'</a>" %(link.get("href")))
我不知道如何从各个页面中提取我需要的所有数据
import bs4, urlparse, json, requests
from os.path import basename as bn
links = [] # the relative paths to companies
data = {} # company name --> company data
base = 'http://data-interview.enigmalabs.org/'
def bs(r): # returns a beautifulsoup table object for the table in the page of the relative path
return bs4.BeautifulSoup(requests.get(urlparse.urljoin(base, r).encode()).content, 'html.parser').find('table')
for i in range(1,11):
print 'Collecting page %d' % i
# add end-point links
links += [a['href'] for a in bs('companies?page=%d' % i).findAll('a')]
for link in links:
print 'Processing %s' % link
name = bn(link)
data[name] = {}
for row in bs(link).findAll('tr'):
desc, cont = row.findAll('td')
data[name][desc.text.encode()] = cont.text.encode()
print json.dumps(data)