LinkedIn 抓取未获取所有数据

Question

来自 linkedin 网站，例如：https://www.linkedin.com/company/10073529?trk=tyah&trkInfo=clickedVertical%3Acompany%2CclickedEntityId%3A10073529%2Cidx%3A1-1-1%2CtarId%3A1461132316737%2Ctas%3Adastrong%20

我正在尝试检索

与 data-li-miniprofile-id 关联的 link

a class="new-miniprofile-container" href="..." data-li-url="..." data-li-miniprofile-id=".. .>

它的父级为、 under 、 under 等...

到目前为止我的代码是这样的：

import requests
from bs4 import beautifulsoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
for link in soup.find_all("a"):
    print(link.get('href'))

我最初只是寻找 class="new-miniprofile-container" 但它返回了一个空数组。我认为原因是当我运行 soup.prettify() （returns 所有 html 抓取的数据）时，它只是在

我觉得问题与 LinkedIn 工程师设置的安全块有关，但我想知道是否有办法获取这些 URL，或者是否有任何其他方法可以获取这些 URL。

Answer 1

您应该使用记录的 LinkedIn REST API instead. There are the relevant company profile related endpoints and you can experiment with the REST API explorer here. And there is a python-linkedin client, which also has the Company API 部分。

LinkedIn 抓取未获取所有数据

LinkedIn scraping not getting all data

html

python

beautifulsoup

linkedin

web-scraping