美丽的汤解析器限制

Beautiful soup parser limits

我正在尝试抓取此网站上列出的 400 个模型的链接:https://www.printables.com/model?category=14&fileType=fff&includeUserGcodes=1,我在下面的代码中将其称为网页。但是,当我 运行 我的代码时,我没有链接。

User_agent = {'User-agent': 'Mozilla/5.0 (X11; CrOS i686 4319.74.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36'}

r = requests.get(webpage, headers = User_agent).text
soup = BeautifulSoup(r,'html5lib')

for link in soup.find_all('a'):
    print(link['href'])

所以我通过以下方式检查链接是否可用: print(soup.prettify()) 和 none 所需链接也出现在 HTML 视图中。这让我假设该网站不允许抓取,但 r.status_code returns 200 意味着我可以抓取。

我可以采用其他方法吗?这些链接还会存储在哪里?谢谢。

数据是通过 Javascript 从外部 URL 加载的,所以 BeautifulSoup 看不到它。要获取有关所有项目的信息,您可以使用以下示例:

import json
import requests


url = "https://www.printables.com/graphql/"

payload = {
    "operationName": "PrintList",
    "query": "query PrintList($limit: Int!, $cursor: String, $categoryId: ID, $materialIds: [Int], $userId: ID, $printerIds: [Int], $licenses: [ID], $ordering: String, $hasModel: Boolean, $filesType: [FilterPrintFilesTypeEnum], $includeUserGcodes: Boolean, $nozzleDiameters: [Float], $weight: IntervalObject, $printDuration: IntervalObject, $publishedDateLimitDays: Int, $featured: Boolean, $featuredNow: Boolean, $usedMaterial: IntervalObject, $hasMake: Boolean, $competitionAwarded: Boolean, $onlyFollowing: Boolean, $collectedByMe: Boolean, $madeByMe: Boolean, $likedByMe: Boolean) {\n  morePrints(\n    limit: $limit\n    cursor: $cursor\n    categoryId: $categoryId\n    materialIds: $materialIds\n    printerIds: $printerIds\n    licenses: $licenses\n    userId: $userId\n    ordering: $ordering\n    hasModel: $hasModel\n    filesType: $filesType\n    nozzleDiameters: $nozzleDiameters\n    includeUserGcodes: $includeUserGcodes\n    weight: $weight\n    printDuration: $printDuration\n    publishedDateLimitDays: $publishedDateLimitDays\n    featured: $featured\n    featuredNow: $featuredNow\n    usedMaterial: $usedMaterial\n    hasMake: $hasMake\n    onlyFollowing: $onlyFollowing\n    competitionAwarded: $competitionAwarded\n    collectedByMe: $collectedByMe\n    madeByMe: $madeByMe\n    liked: $likedByMe\n  ) {\n    cursor\n    items {\n      ...PrintListFragment\n      printer {\n        id\n        __typename\n      }\n      user {\n        rating\n        __typename\n      }\n      __typename\n    }\n    __typename\n  }\n}\n\nfragment PrintListFragment on PrintType {\n  id\n  name\n  slug\n  ratingAvg\n  ratingCount\n  likesCount\n  liked\n  datePublished\n  dateFeatured\n  firstPublish\n  downloadCount\n  displayCount\n  inMyCollections\n  foundInUserGcodes\n  userGcodeCount\n  userGcodesCount\n  materials {\n    id\n    __typename\n  }\n  category {\n    id\n    path {\n      id\n      name\n      __typename\n    }\n    __typename\n  }\n  modified\n  images {\n    ...ImageSimpleFragment\n    __typename\n  }\n  filesType\n  hasModel\n  user {\n    ...AvatarUserFragment\n    __typename\n  }\n  ...LatestCompetitionResult\n  __typename\n}\n\nfragment AvatarUserFragment on UserType {\n  id\n  publicUsername\n  avatarFilePath\n  slug\n  badgesProfileLevel {\n    profileLevel\n    __typename\n  }\n  __typename\n}\n\nfragment LatestCompetitionResult on PrintType {\n  latestCompetitionResult {\n    placement\n    competitionId\n    __typename\n  }\n  __typename\n}\n\nfragment ImageSimpleFragment on PrintImageType {\n  id\n  filePath\n  rotation\n  __typename\n}\n",
    "variables": {
        "categoryId": "14",
        "collectedByMe": False,
        "competitionAwarded": False,
        "cursor": "",
        "featured": False,
        "filesType": ["GCODE"],
        "hasMake": False,
        "includeUserGcodes": True,
        "likedByMe": False,
        "limit": 36,
        "madeByMe": False,
        "materialIds": None,
        "nozzleDiameters": None,
        "ordering": "-first_publish",
        "printDuration": None,
        "printerIds": None,
        "publishedDateLimitDays": None,
        "weight": None,
    },
}

cnt = 0
while True:
    data = requests.post(url, json=payload).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    for i in data["data"]["morePrints"]["items"]:
        cnt += 1
        print(
            cnt,
            i["name"],
            "https://www.printables.com/model/{}-{}".format(i["id"], i["slug"]),
        )

    if not data["data"]["morePrints"]["cursor"]:
        break

    payload["variables"]["cursor"] = data["data"]["morePrints"]["cursor"]

打印:

1 White Spiral Vase https://www.printables.com/model/189114-white-spiral-vase
2 Calibrating Before Battle - 3DPN Mr. Print-It - Superhero Remix https://www.printables.com/model/188733-calibrating-before-battle-3dpn-mr-print-it-superhe
3 twitter 3d bird https://www.printables.com/model/187083-twitter-3d-bird
4 Welcome To Rapture plaque https://www.printables.com/model/186669-welcome-to-rapture-plaque

...