使用 python 进行网页抓取

Question

我正试图从这个 website 中提取数据，几乎不可能抓取，因为在任何搜索之后它都没有改变它的 URL。

我想根据 PUBLISHER IPI '00144443097' 搜索并提取他们在class="items-container" 中的所有数据。

我的代码

quote_page = 'https://portal.themlc.com/search'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('section', attrs={'class': 'items-container'})
name = name_box.text
print(name)

由于搜索后 URL 没有改变，所以它没有给我任何价值。

提取值后，我想在 pandas

中对它们进行排序

Answer 1

当url没有变化时，您可以使用开发者工具查看是否正在调用api。在这种情况下，有两个 api。一个是作者的基本信息，一个是作品的信息。您可以从此处根据需要解析 json 响应。

注意：这是 post，不是 get

url = 'https://api.ptl.themlc.com/api/search/writer?page=1&limit=10'
payload = {'writerIpi': "00144443097"}
requests.post(url, json=payload).json()

url = 'https://api.ptl.themlc.com/api/search/work?page=1&limit=10'
payload = {'writerIpi': "00144443097"}
requests.post(url, json=payload).json()

url = 'https://api.ptl.themlc.com/api/search/publisher?page=1&limit=10'
payload = {"publisherIpi":"00144443097"}
requests.post(url, json=payload).json()

# this url gets the 161 works for the publisheripid you want.  it's convoluted, but you may be able to automate, but I used developer tools to find the right publisheripid
url = 'https://api.ptl.themlc.com/api/search/work?page=1&limit=10'
payload = {'publisherIpId': "7305902"}
requests.post(url, json=payload).json()

Answer 2

要找到publisheripid，需要打开作者内部的部分作品，寻找作品端点。希望这张图片能正确加载

使用 python 进行网页抓取

Web-scraping using python

urllib

beautifulsoup

request

web-scraping

pandas