Python - 在特定注释节点之间提取数据 BeautifulSoup 4
Python - Extracting data between specific comment nodes with BeautifulSoup 4
希望从网站中挑选出特定数据,例如价格、公司信息等。幸运的是,网站设计者放置了很多标签,例如
<!-- Begin Services Table -->
' desired data
<!-- End Services Table -->
为了让 BS4 return 给定标签之间的字符串,我需要什么样的代码?
import requests
from bs4 import BeautifulSoup
url = "http://www.100ll.com/searchresults.phpclear_previous=true&searchfor="+'KPLN'+"&submit.x=0&submit.y=0"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
text_list = soup.find(id="framediv").find_all(text=True)
start_index = text_list.index(' Begin Fuel Information Table ') + 1
end_index = text_list.index(' End Fuel Information Table ')
for item in text_list[start_index:end_index]:
print(item)
这是有问题的网站:
http://www.100ll.com/showfbo.php?HashID=cf5f18404c062da6fa11e3af41358873
如果你想 select table
元素在那些特定评论之后,那么你可以 select 所有评论节点,根据所需的文本过滤它们,然后 select 下一个兄弟 table
元素:
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
comments = soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
if comment.strip() == 'Begin Services Table':
table = comment.find_next_sibling('table')
print(table)
或者,如果您想获取这两个评论之间的 all 数据,那么您可以找到第一个评论,然后遍历所有下一个兄弟姐妹,直到找到结束评论:
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
data = []
for comment in soup.find_all(string=lambda text:isinstance(text, Comment)):
if comment.strip() == 'Begin Services Table':
next_node = comment.next_sibling
while next_node and next_node.next_sibling:
data.append(next_node)
next_node = next_node.next_sibling
if not next_node.name and next_node.strip() == 'End Services Table': break;
print(data)
希望从网站中挑选出特定数据,例如价格、公司信息等。幸运的是,网站设计者放置了很多标签,例如
<!-- Begin Services Table -->
' desired data
<!-- End Services Table -->
为了让 BS4 return 给定标签之间的字符串,我需要什么样的代码?
import requests
from bs4 import BeautifulSoup
url = "http://www.100ll.com/searchresults.phpclear_previous=true&searchfor="+'KPLN'+"&submit.x=0&submit.y=0"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
text_list = soup.find(id="framediv").find_all(text=True)
start_index = text_list.index(' Begin Fuel Information Table ') + 1
end_index = text_list.index(' End Fuel Information Table ')
for item in text_list[start_index:end_index]:
print(item)
这是有问题的网站:
http://www.100ll.com/showfbo.php?HashID=cf5f18404c062da6fa11e3af41358873
如果你想 select table
元素在那些特定评论之后,那么你可以 select 所有评论节点,根据所需的文本过滤它们,然后 select 下一个兄弟 table
元素:
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
comments = soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
if comment.strip() == 'Begin Services Table':
table = comment.find_next_sibling('table')
print(table)
或者,如果您想获取这两个评论之间的 all 数据,那么您可以找到第一个评论,然后遍历所有下一个兄弟姐妹,直到找到结束评论:
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
data = []
for comment in soup.find_all(string=lambda text:isinstance(text, Comment)):
if comment.strip() == 'Begin Services Table':
next_node = comment.next_sibling
while next_node and next_node.next_sibling:
data.append(next_node)
next_node = next_node.next_sibling
if not next_node.name and next_node.strip() == 'End Services Table': break;
print(data)