当服务器 returns 喜欢此数据时,我该如何抓取?
How can I scrape when server returns like this data?
我试图使用 Beautifulsoup
和 python requests
抓取一个网站,服务器 returns 响应内容类型 text/javascript
和 response body
包含此数据:
Element.update("students", "<link href=\"https://somelinks.links\" media=\"screen\" rel=\"stylesheet\" type=\"text/css\" />\n\n<table class=\"gray_table_list\" align=\"center\" width=\"100%\" cellpadding=\"0\" cellspacing=\"0\">\n \n <tr class=\"main_head back_ground_color\">\n <td class=\"sl-col\">Sl No.</td>\n <td class=\"set_border_right\"> Name</td>\n <td class=\"set_border_right\">IDNo.</td>\n \n <td class=\"set_border_right\"></td>\n </tr>\n <tr class=\"tr-blank\">\n\n \n </tr>\n \n <tr class=\"row-bodd\">\n <td class=\"set_border_right col-1\">\n 1\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">JOHN DOE </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID12345\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n \n <tr class=\"row-beven\">\n <td class=\"set_border_right col-1\">\n 2\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">Somename here </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID45555\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n \n <tr class=\"row-bodd\">\n <td class=\"set_border_right col-1\">\n 3\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">name here </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID7878\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n\n </tr>\n \n \n </table>\n \n");
我已经编辑了自包含问题的服务器响应。但是我怎样才能在 Element.update
中抓取 table 并且我还想在 table 中抓取 a tag
我的意思是从 [=19] 中提取数据 link/linkprofile/linkID1458556
=]
谢谢
您可以使用 re
模块从这个 Javascript 函数中提取 HTML 部分,然后用 BeautifulSoup 正常解析它。例如:
import re
from bs4 import BeautifulSoup
s = """
Element.update("students", "<link href=\"https://somelinks.links\" media=\"screen\" rel=\"stylesheet\" type=\"text/css\" />\n\n<table class=\"gray_table_list\" align=\"center\" width=\"100%\" cellpadding=\"0\" cellspacing=\"0\">\n \n <tr class=\"main_head back_ground_color\">\n <td class=\"sl-col\">Sl No.</td>\n <td class=\"set_border_right\"> Name</td>\n <td class=\"set_border_right\">IDNo.</td>\n \n <td class=\"set_border_right\"></td>\n </tr>\n <tr class=\"tr-blank\">\n\n \n </tr>\n \n <tr class=\"row-bodd\">\n <td class=\"set_border_right col-1\">\n 1\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">JOHN DOE </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID12345\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n \n <tr class=\"row-beven\">\n <td class=\"set_border_right col-1\">\n 2\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">Somename here </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID45555\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n \n <tr class=\"row-bodd\">\n <td class=\"set_border_right col-1\">\n 3\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">name here </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID7878\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n\n </tr>\n \n \n </table>\n \n");
"""
html_doc = re.search(r'"students", "(.*?)"\);', s, flags=re.S).group(1)
soup = BeautifulSoup(html_doc, "html.parser")
for tr in soup.select("tr"):
tds = [td.get_text(strip=True) for td in tr.select("td")]
print(*tds, sep="\t")
打印:
Sl No. Name IDNo.
1 JOHN DOE ID12345 View profile
2 Somename here ID45555 View profile
3 name here ID7878 View profile
编辑:获取 <a>
链接:
for tr in soup.select("tr:has(a)"):
print(tr.a["href"])
打印:
/link/linkprofile/linkID1458556
/link/linkprofile/linkID1458556\
/link/linkprofile/linkID1458556\
我试图使用 Beautifulsoup
和 python requests
抓取一个网站,服务器 returns 响应内容类型 text/javascript
和 response body
包含此数据:
Element.update("students", "<link href=\"https://somelinks.links\" media=\"screen\" rel=\"stylesheet\" type=\"text/css\" />\n\n<table class=\"gray_table_list\" align=\"center\" width=\"100%\" cellpadding=\"0\" cellspacing=\"0\">\n \n <tr class=\"main_head back_ground_color\">\n <td class=\"sl-col\">Sl No.</td>\n <td class=\"set_border_right\"> Name</td>\n <td class=\"set_border_right\">IDNo.</td>\n \n <td class=\"set_border_right\"></td>\n </tr>\n <tr class=\"tr-blank\">\n\n \n </tr>\n \n <tr class=\"row-bodd\">\n <td class=\"set_border_right col-1\">\n 1\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">JOHN DOE </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID12345\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n \n <tr class=\"row-beven\">\n <td class=\"set_border_right col-1\">\n 2\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">Somename here </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID45555\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n \n <tr class=\"row-bodd\">\n <td class=\"set_border_right col-1\">\n 3\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">name here </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID7878\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n\n </tr>\n \n \n </table>\n \n");
我已经编辑了自包含问题的服务器响应。但是我怎样才能在 Element.update
中抓取 table 并且我还想在 table 中抓取 a tag
我的意思是从 [=19] 中提取数据 link/linkprofile/linkID1458556
=]
谢谢
您可以使用 re
模块从这个 Javascript 函数中提取 HTML 部分,然后用 BeautifulSoup 正常解析它。例如:
import re
from bs4 import BeautifulSoup
s = """
Element.update("students", "<link href=\"https://somelinks.links\" media=\"screen\" rel=\"stylesheet\" type=\"text/css\" />\n\n<table class=\"gray_table_list\" align=\"center\" width=\"100%\" cellpadding=\"0\" cellspacing=\"0\">\n \n <tr class=\"main_head back_ground_color\">\n <td class=\"sl-col\">Sl No.</td>\n <td class=\"set_border_right\"> Name</td>\n <td class=\"set_border_right\">IDNo.</td>\n \n <td class=\"set_border_right\"></td>\n </tr>\n <tr class=\"tr-blank\">\n\n \n </tr>\n \n <tr class=\"row-bodd\">\n <td class=\"set_border_right col-1\">\n 1\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">JOHN DOE </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID12345\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n \n <tr class=\"row-beven\">\n <td class=\"set_border_right col-1\">\n 2\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">Somename here </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID45555\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n \n <tr class=\"row-bodd\">\n <td class=\"set_border_right col-1\">\n 3\n </td>\n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">name here </a>\n </td>\n\n <td class=\"set_border_right col-1\">\n ID7878\n </td>\n\n \n\n <td class=\"set_border_right col-1\">\n <a href=\"/link/linkprofile/linkID1458556\">View profile</a>\n </td>\n </tr>\n\n </tr>\n \n \n </table>\n \n");
"""
html_doc = re.search(r'"students", "(.*?)"\);', s, flags=re.S).group(1)
soup = BeautifulSoup(html_doc, "html.parser")
for tr in soup.select("tr"):
tds = [td.get_text(strip=True) for td in tr.select("td")]
print(*tds, sep="\t")
打印:
Sl No. Name IDNo.
1 JOHN DOE ID12345 View profile
2 Somename here ID45555 View profile
3 name here ID7878 View profile
编辑:获取 <a>
链接:
for tr in soup.select("tr:has(a)"):
print(tr.a["href"])
打印:
/link/linkprofile/linkID1458556
/link/linkprofile/linkID1458556\
/link/linkprofile/linkID1458556\