美丽的汤解析器限制
Beautiful soup parser limits
我正在尝试抓取此网站上列出的 400 个模型的链接:https://www.printables.com/model?category=14&fileType=fff&includeUserGcodes=1,我在下面的代码中将其称为网页。但是,当我 运行 我的代码时,我没有链接。
User_agent = {'User-agent': 'Mozilla/5.0 (X11; CrOS i686 4319.74.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36'}
r = requests.get(webpage, headers = User_agent).text
soup = BeautifulSoup(r,'html5lib')
for link in soup.find_all('a'):
print(link['href'])
所以我通过以下方式检查链接是否可用: print(soup.prettify())
和 none 所需链接也出现在 HTML 视图中。这让我假设该网站不允许抓取,但 r.status_code
returns 200 意味着我可以抓取。
我可以采用其他方法吗?这些链接还会存储在哪里?谢谢。
数据是通过 Javascript 从外部 URL 加载的,所以 BeautifulSoup 看不到它。要获取有关所有项目的信息,您可以使用以下示例:
import json
import requests
url = "https://www.printables.com/graphql/"
payload = {
"operationName": "PrintList",
"query": "query PrintList($limit: Int!, $cursor: String, $categoryId: ID, $materialIds: [Int], $userId: ID, $printerIds: [Int], $licenses: [ID], $ordering: String, $hasModel: Boolean, $filesType: [FilterPrintFilesTypeEnum], $includeUserGcodes: Boolean, $nozzleDiameters: [Float], $weight: IntervalObject, $printDuration: IntervalObject, $publishedDateLimitDays: Int, $featured: Boolean, $featuredNow: Boolean, $usedMaterial: IntervalObject, $hasMake: Boolean, $competitionAwarded: Boolean, $onlyFollowing: Boolean, $collectedByMe: Boolean, $madeByMe: Boolean, $likedByMe: Boolean) {\n morePrints(\n limit: $limit\n cursor: $cursor\n categoryId: $categoryId\n materialIds: $materialIds\n printerIds: $printerIds\n licenses: $licenses\n userId: $userId\n ordering: $ordering\n hasModel: $hasModel\n filesType: $filesType\n nozzleDiameters: $nozzleDiameters\n includeUserGcodes: $includeUserGcodes\n weight: $weight\n printDuration: $printDuration\n publishedDateLimitDays: $publishedDateLimitDays\n featured: $featured\n featuredNow: $featuredNow\n usedMaterial: $usedMaterial\n hasMake: $hasMake\n onlyFollowing: $onlyFollowing\n competitionAwarded: $competitionAwarded\n collectedByMe: $collectedByMe\n madeByMe: $madeByMe\n liked: $likedByMe\n ) {\n cursor\n items {\n ...PrintListFragment\n printer {\n id\n __typename\n }\n user {\n rating\n __typename\n }\n __typename\n }\n __typename\n }\n}\n\nfragment PrintListFragment on PrintType {\n id\n name\n slug\n ratingAvg\n ratingCount\n likesCount\n liked\n datePublished\n dateFeatured\n firstPublish\n downloadCount\n displayCount\n inMyCollections\n foundInUserGcodes\n userGcodeCount\n userGcodesCount\n materials {\n id\n __typename\n }\n category {\n id\n path {\n id\n name\n __typename\n }\n __typename\n }\n modified\n images {\n ...ImageSimpleFragment\n __typename\n }\n filesType\n hasModel\n user {\n ...AvatarUserFragment\n __typename\n }\n ...LatestCompetitionResult\n __typename\n}\n\nfragment AvatarUserFragment on UserType {\n id\n publicUsername\n avatarFilePath\n slug\n badgesProfileLevel {\n profileLevel\n __typename\n }\n __typename\n}\n\nfragment LatestCompetitionResult on PrintType {\n latestCompetitionResult {\n placement\n competitionId\n __typename\n }\n __typename\n}\n\nfragment ImageSimpleFragment on PrintImageType {\n id\n filePath\n rotation\n __typename\n}\n",
"variables": {
"categoryId": "14",
"collectedByMe": False,
"competitionAwarded": False,
"cursor": "",
"featured": False,
"filesType": ["GCODE"],
"hasMake": False,
"includeUserGcodes": True,
"likedByMe": False,
"limit": 36,
"madeByMe": False,
"materialIds": None,
"nozzleDiameters": None,
"ordering": "-first_publish",
"printDuration": None,
"printerIds": None,
"publishedDateLimitDays": None,
"weight": None,
},
}
cnt = 0
while True:
data = requests.post(url, json=payload).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for i in data["data"]["morePrints"]["items"]:
cnt += 1
print(
cnt,
i["name"],
"https://www.printables.com/model/{}-{}".format(i["id"], i["slug"]),
)
if not data["data"]["morePrints"]["cursor"]:
break
payload["variables"]["cursor"] = data["data"]["morePrints"]["cursor"]
打印:
1 White Spiral Vase https://www.printables.com/model/189114-white-spiral-vase
2 Calibrating Before Battle - 3DPN Mr. Print-It - Superhero Remix https://www.printables.com/model/188733-calibrating-before-battle-3dpn-mr-print-it-superhe
3 twitter 3d bird https://www.printables.com/model/187083-twitter-3d-bird
4 Welcome To Rapture plaque https://www.printables.com/model/186669-welcome-to-rapture-plaque
...
我正在尝试抓取此网站上列出的 400 个模型的链接:https://www.printables.com/model?category=14&fileType=fff&includeUserGcodes=1,我在下面的代码中将其称为网页。但是,当我 运行 我的代码时,我没有链接。
User_agent = {'User-agent': 'Mozilla/5.0 (X11; CrOS i686 4319.74.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36'}
r = requests.get(webpage, headers = User_agent).text
soup = BeautifulSoup(r,'html5lib')
for link in soup.find_all('a'):
print(link['href'])
所以我通过以下方式检查链接是否可用: print(soup.prettify())
和 none 所需链接也出现在 HTML 视图中。这让我假设该网站不允许抓取,但 r.status_code
returns 200 意味着我可以抓取。
我可以采用其他方法吗?这些链接还会存储在哪里?谢谢。
数据是通过 Javascript 从外部 URL 加载的,所以 BeautifulSoup 看不到它。要获取有关所有项目的信息,您可以使用以下示例:
import json
import requests
url = "https://www.printables.com/graphql/"
payload = {
"operationName": "PrintList",
"query": "query PrintList($limit: Int!, $cursor: String, $categoryId: ID, $materialIds: [Int], $userId: ID, $printerIds: [Int], $licenses: [ID], $ordering: String, $hasModel: Boolean, $filesType: [FilterPrintFilesTypeEnum], $includeUserGcodes: Boolean, $nozzleDiameters: [Float], $weight: IntervalObject, $printDuration: IntervalObject, $publishedDateLimitDays: Int, $featured: Boolean, $featuredNow: Boolean, $usedMaterial: IntervalObject, $hasMake: Boolean, $competitionAwarded: Boolean, $onlyFollowing: Boolean, $collectedByMe: Boolean, $madeByMe: Boolean, $likedByMe: Boolean) {\n morePrints(\n limit: $limit\n cursor: $cursor\n categoryId: $categoryId\n materialIds: $materialIds\n printerIds: $printerIds\n licenses: $licenses\n userId: $userId\n ordering: $ordering\n hasModel: $hasModel\n filesType: $filesType\n nozzleDiameters: $nozzleDiameters\n includeUserGcodes: $includeUserGcodes\n weight: $weight\n printDuration: $printDuration\n publishedDateLimitDays: $publishedDateLimitDays\n featured: $featured\n featuredNow: $featuredNow\n usedMaterial: $usedMaterial\n hasMake: $hasMake\n onlyFollowing: $onlyFollowing\n competitionAwarded: $competitionAwarded\n collectedByMe: $collectedByMe\n madeByMe: $madeByMe\n liked: $likedByMe\n ) {\n cursor\n items {\n ...PrintListFragment\n printer {\n id\n __typename\n }\n user {\n rating\n __typename\n }\n __typename\n }\n __typename\n }\n}\n\nfragment PrintListFragment on PrintType {\n id\n name\n slug\n ratingAvg\n ratingCount\n likesCount\n liked\n datePublished\n dateFeatured\n firstPublish\n downloadCount\n displayCount\n inMyCollections\n foundInUserGcodes\n userGcodeCount\n userGcodesCount\n materials {\n id\n __typename\n }\n category {\n id\n path {\n id\n name\n __typename\n }\n __typename\n }\n modified\n images {\n ...ImageSimpleFragment\n __typename\n }\n filesType\n hasModel\n user {\n ...AvatarUserFragment\n __typename\n }\n ...LatestCompetitionResult\n __typename\n}\n\nfragment AvatarUserFragment on UserType {\n id\n publicUsername\n avatarFilePath\n slug\n badgesProfileLevel {\n profileLevel\n __typename\n }\n __typename\n}\n\nfragment LatestCompetitionResult on PrintType {\n latestCompetitionResult {\n placement\n competitionId\n __typename\n }\n __typename\n}\n\nfragment ImageSimpleFragment on PrintImageType {\n id\n filePath\n rotation\n __typename\n}\n",
"variables": {
"categoryId": "14",
"collectedByMe": False,
"competitionAwarded": False,
"cursor": "",
"featured": False,
"filesType": ["GCODE"],
"hasMake": False,
"includeUserGcodes": True,
"likedByMe": False,
"limit": 36,
"madeByMe": False,
"materialIds": None,
"nozzleDiameters": None,
"ordering": "-first_publish",
"printDuration": None,
"printerIds": None,
"publishedDateLimitDays": None,
"weight": None,
},
}
cnt = 0
while True:
data = requests.post(url, json=payload).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for i in data["data"]["morePrints"]["items"]:
cnt += 1
print(
cnt,
i["name"],
"https://www.printables.com/model/{}-{}".format(i["id"], i["slug"]),
)
if not data["data"]["morePrints"]["cursor"]:
break
payload["variables"]["cursor"] = data["data"]["morePrints"]["cursor"]
打印:
1 White Spiral Vase https://www.printables.com/model/189114-white-spiral-vase
2 Calibrating Before Battle - 3DPN Mr. Print-It - Superhero Remix https://www.printables.com/model/188733-calibrating-before-battle-3dpn-mr-print-it-superhe
3 twitter 3d bird https://www.printables.com/model/187083-twitter-3d-bird
4 Welcome To Rapture plaque https://www.printables.com/model/186669-welcome-to-rapture-plaque
...