Xpath 与 scrapy shell
Xpath with scrapy shell
我正在尝试使用以下 xpath 命令使用 scrapy shell 抓取此页面 https://techmoran.com/category/startups/ 上的启动文章:
>>> n=response.xpath('//article[contains(@class, "jeg_post jeg_pl_md_box")]/div/div/a/@href').getall()
>>> len(n)
30
该命令仅返回 30 篇文章,而不是 2880 篇,我在尝试
//文章[包含(@class, "jeg_post jeg_pl_md_box")]/div/div/a/@href'
Chrome 中的检查员。我如何获得其余文章?
这是一个无限滚动,每页有30篇文章。
找到 url 和请求:
scrapy shell
In [1]: url = 'https://techmoran.com/?ajax-request=jnews'
In [2]: page = 1
In [3]: data = {
"lang": "en_US",
"action": "jnews_module_ajax_jnews_block_34",
"module": "true",
"data[filter]": "0",
"data[filter_type]": "all",
"data[current_page]": f"{page}",
"data[attribute][header_icon]": "",
"data[attribute][first_title]": "",
"data[attribute][second_title]": "",
"data[attribute][url]": "",
"data[attribute][header_type]": "heading_6",
"data[attribute][header_background]": "",
"data[attribute][header_secondary_background]": "",
"data[attribute][header_text_color]": "",
"data[attribute][header_line_color]": "",
"data[attribute][header_accent_color]": "",
"data[attribute][header_filter_category]": "",
"data[attribute][header_filter_author]": "",
"data[attribute][header_filter_tag]": "",
"data[attribute][header_filter_text]": "All",
"data[attribute][post_type]": "post",
"data[attribute][content_type]": "all",
"data[attribute][number_post]": "30",
"data[attribute][post_offset]": "7",
"data[attribute][unique_content]": "disable",
"data[attribute][include_post]": "",
"data[attribute][exclude_post]": "",
"data[attribute][include_category]": "5",
"data[attribute][exclude_category]": "",
"data[attribute][include_author]": "",
"data[attribute][include_tag]": "",
"data[attribute][exclude_tag]": "",
"data[attribute][gallerycat]": "",
"data[attribute][sort_by]": "latest",
"data[attribute][date_format]": "default",
"data[attribute][date_format_custom]": "Y/m/d",
"data[attribute][pagination_mode]": "scrollload",
"data[attribute][pagination_nextprev_showtext]": "",
"data[attribute][pagination_number_post]": "30",
"data[attribute][pagination_scroll_limit]": "0",
"data[attribute][el_id]": "",
"data[attribute][el_class]": "",
"data[attribute][scheme]": "",
"data[attribute][column_width]": "auto",
"data[attribute][title_color]": "",
"data[attribute][accent_color]": "",
"data[attribute][alt_color]": "",
"data[attribute][excerpt_color]": "",
"data[attribute][css]": "",
"data[attribute][excerpt_length]": "20",
"data[attribute][paged]": "1",
"data[attribute][pagination_align]": "center",
"data[attribute][pagination_navtext]": "false",
"data[attribute][pagination_pageinfo]": "false",
"data[attribute][boxed]": "false",
"data[attribute][boxed_shadow]": "false",
"data[attribute][box_shadow]": "false",
"data[attribute][push_archive]": "true",
"data[attribute][ads_type]": "googleads",
"data[attribute][ads_position]": "5",
"data[attribute][ads_random]": "false",
"data[attribute][ads_image]": "",
"data[attribute][ads_image_link]": "",
"data[attribute][ads_image_alt]": "",
"data[attribute][ads_image_new_tab]": "true",
"data[attribute][google_publisher_id]": "<script+data-ad-client=\"ca-pub-4793005812421081\"+async+src=\"https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js\"></script>",
"data[attribute][google_slot_id]": "<script+async+src=\"https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js\"></script>+<ins+class=\"adsbygoogle\"++++++style=\"display:block\"++++++data-ad-format=\"fluid\"++++++data-ad-layout-key=\"-cs+6-1e-q8+12p\"++++++data-ad-client=\"ca-pub-4793005812421081\"++++++data-ad-slot=\"8266744001\"></ins>+<script>++++++(adsbygoogle+=+window.adsbygoogle+||+[]).push({});+</script>",
"data[attribute][google_desktop]": "auto",
"data[attribute][google_tab]": "auto",
"data[attribute][google_phone]": "auto",
"data[attribute][code]": "",
"data[attribute][ads_class]": "inline_module",
"data[attribute][column_class]": "jeg_col_3o3",
"data[attribute][class]": "jnews_block_34"
}
In [4]: req = scrapy.FormRequest(url=url, formdata=data)
In [5]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <POST https://techmoran.com/?ajax-request=jnews> (referer: None)
######## Here you'll get a json (content = response.json()['content']), scrape your articles from it. #######
In [6]: page += 1
........
........
........
我正在尝试使用以下 xpath 命令使用 scrapy shell 抓取此页面 https://techmoran.com/category/startups/ 上的启动文章:
>>> n=response.xpath('//article[contains(@class, "jeg_post jeg_pl_md_box")]/div/div/a/@href').getall()
>>> len(n)
30
该命令仅返回 30 篇文章,而不是 2880 篇,我在尝试 //文章[包含(@class, "jeg_post jeg_pl_md_box")]/div/div/a/@href' Chrome 中的检查员。我如何获得其余文章?
这是一个无限滚动,每页有30篇文章。
找到 url 和请求:
scrapy shell
In [1]: url = 'https://techmoran.com/?ajax-request=jnews'
In [2]: page = 1
In [3]: data = {
"lang": "en_US",
"action": "jnews_module_ajax_jnews_block_34",
"module": "true",
"data[filter]": "0",
"data[filter_type]": "all",
"data[current_page]": f"{page}",
"data[attribute][header_icon]": "",
"data[attribute][first_title]": "",
"data[attribute][second_title]": "",
"data[attribute][url]": "",
"data[attribute][header_type]": "heading_6",
"data[attribute][header_background]": "",
"data[attribute][header_secondary_background]": "",
"data[attribute][header_text_color]": "",
"data[attribute][header_line_color]": "",
"data[attribute][header_accent_color]": "",
"data[attribute][header_filter_category]": "",
"data[attribute][header_filter_author]": "",
"data[attribute][header_filter_tag]": "",
"data[attribute][header_filter_text]": "All",
"data[attribute][post_type]": "post",
"data[attribute][content_type]": "all",
"data[attribute][number_post]": "30",
"data[attribute][post_offset]": "7",
"data[attribute][unique_content]": "disable",
"data[attribute][include_post]": "",
"data[attribute][exclude_post]": "",
"data[attribute][include_category]": "5",
"data[attribute][exclude_category]": "",
"data[attribute][include_author]": "",
"data[attribute][include_tag]": "",
"data[attribute][exclude_tag]": "",
"data[attribute][gallerycat]": "",
"data[attribute][sort_by]": "latest",
"data[attribute][date_format]": "default",
"data[attribute][date_format_custom]": "Y/m/d",
"data[attribute][pagination_mode]": "scrollload",
"data[attribute][pagination_nextprev_showtext]": "",
"data[attribute][pagination_number_post]": "30",
"data[attribute][pagination_scroll_limit]": "0",
"data[attribute][el_id]": "",
"data[attribute][el_class]": "",
"data[attribute][scheme]": "",
"data[attribute][column_width]": "auto",
"data[attribute][title_color]": "",
"data[attribute][accent_color]": "",
"data[attribute][alt_color]": "",
"data[attribute][excerpt_color]": "",
"data[attribute][css]": "",
"data[attribute][excerpt_length]": "20",
"data[attribute][paged]": "1",
"data[attribute][pagination_align]": "center",
"data[attribute][pagination_navtext]": "false",
"data[attribute][pagination_pageinfo]": "false",
"data[attribute][boxed]": "false",
"data[attribute][boxed_shadow]": "false",
"data[attribute][box_shadow]": "false",
"data[attribute][push_archive]": "true",
"data[attribute][ads_type]": "googleads",
"data[attribute][ads_position]": "5",
"data[attribute][ads_random]": "false",
"data[attribute][ads_image]": "",
"data[attribute][ads_image_link]": "",
"data[attribute][ads_image_alt]": "",
"data[attribute][ads_image_new_tab]": "true",
"data[attribute][google_publisher_id]": "<script+data-ad-client=\"ca-pub-4793005812421081\"+async+src=\"https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js\"></script>",
"data[attribute][google_slot_id]": "<script+async+src=\"https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js\"></script>+<ins+class=\"adsbygoogle\"++++++style=\"display:block\"++++++data-ad-format=\"fluid\"++++++data-ad-layout-key=\"-cs+6-1e-q8+12p\"++++++data-ad-client=\"ca-pub-4793005812421081\"++++++data-ad-slot=\"8266744001\"></ins>+<script>++++++(adsbygoogle+=+window.adsbygoogle+||+[]).push({});+</script>",
"data[attribute][google_desktop]": "auto",
"data[attribute][google_tab]": "auto",
"data[attribute][google_phone]": "auto",
"data[attribute][code]": "",
"data[attribute][ads_class]": "inline_module",
"data[attribute][column_class]": "jeg_col_3o3",
"data[attribute][class]": "jnews_block_34"
}
In [4]: req = scrapy.FormRequest(url=url, formdata=data)
In [5]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <POST https://techmoran.com/?ajax-request=jnews> (referer: None)
######## Here you'll get a json (content = response.json()['content']), scrape your articles from it. #######
In [6]: page += 1
........
........
........