如何使用 ironpython 提取特定的 html 行(使用 flex 容器)?
How to extract specific html lines (with a flex container) using ironpython?
我在 Grasshopper 和 Rhino 上使用 IronPython 2.7.9.0 从这个 link 上的特定小部件网络抓取数据:https://vemcount.app/embed/widget/uOCRuLPangWo5fT?locale=en
我使用的代码如下
import urllib
import os
web = urllib.urlopen(url)
html = web.read()
web.close()
html 输出包含此 link 中的所有 html 代码,但我需要的部分除外。当我在 chrome 上检查它时,它旁边有一个“弹性”按钮,如下图所示。
image that summarizes the issue I am facing
在带有“flex”按钮的行下的任何内容都不会出现在抓取结果中,而是作为空白行出现。
这是输出 html 我得到:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Central Library - Duhig North & Link</title>
<meta charset="utf-8">
<meta name="google" content="notranslate">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name="csrf-token" content="">
<link rel="stylesheet" href="/build/app.css?id=2fefc4f9faa59eebcb4b">
<link rel="stylesheet" href="https://vemcount.app/fonts/hamburg_serial/stylesheet.css">
<style>
#embed, #main {
height: 100vh;
}
.vue-grid-item {
margin-bottom: 0px !important;
}
.powered_by {
position: absolute;
bottom: 0px;
right: 0px;
background-color: rgba(0, 0, 0, 0.18);
color: #fff;
padding: 2px 5px;
font-size: 9px;
}
.powered_by:hover, .powered_by:link, .powered_by:visited {
text-decoration: none;
display: none;
}
.dashboard-widget .relative {
overflow: hidden !important;
}
</style>
<script>
window.App = {"socketAppKey":"eJSkWUHWpwolvjVcT2ZxUJZXnDpxtRljdZl74fKr","socketCluster":null,"socketHost":"websocket.vemcount.com","socketPort":443,"socketSecurePort":443,"socketDisableStats":true,"socketEncrypted":true,"locale":"en","settings":[{"name":"type","value":"{\"count_in\":\"column\"}"},{"name":"period","value":"[\"yesterday\"]"},{"name":"period_step","value":"hour"},{"name":"hide_datalabel","value":"0"},{"name":"currency","value":"AUD"},{"name":"show_days","value":"[0,1,2,3,4,5,6]"},{"name":"show_months","value":"[1,2,3,4,5,6,7,8,9,10,11,12]"},{"name":"show_hours_from","value":"00:00"},{"name":"show_hours_to","value":"23:45"},{"name":"data_heatmap","value":"blue"},{"name":"weather_metrics","value":"0"},{"name":"first_day_of_week","value":"1"},{"name":"time_format24","value":"time_format24"},{"name":"date_time_format","value":"2"},{"name":"number_grouping","value":","},{"name":"number_decimal","value":"."},{"name":"opening_hours_overlap","value":"0"},{"name":"data_output","value":"count_in"}],"sound":null};
</script>
<script src="/build/lang/en.js?v=2022.04.4"></script>
</head>
<body class="bg-transparent">
<main id="main">
<div id="embed" >
<div class="w-full h-full vue-grid-item cssTransforms" style="position: absolute;">
<live-inside :embedded="true" :widget="{"id":81438,"pane_id":4005,"title":"Central Library - Duhig North & Link","description":"Live occupancy \/ Seating capacity","x":0,"y":0,"w":2,"h":1,"bg_color":"red","text_color":"black","type":"live-inside","secret":"uOCRuLPangWo5fT","internal":"VRg4JTIRrtJ7Pwg","embeddable":1,"content":{"target":1100,"bidirectional":true,"target_enable":true,"prettify":false,"target_type":"donut","target_donut_hide_metric":false,"target_donut_target_hide_label":false,"target_visual_inside_text":null,"target_visual_available_text":null,"target_screen_ok_title":null,"target_screen_ok_text":null,"target_screen_ok_color":"#38A169","target_screen_ok_image":-1,"target_screen_warning_title":null,"target_screen_warning_pe</live-inside>
</div>
</div>
</main>
<a title=" Vemco Group A/S " class="powered_by" target="_blank"
href="http://vemcount.com">Powered by
<b>vemcount.com</b>
</a>
<script src="/build/manifest.js?id=7f2e9aa3431c681a4683"></script>
<script src="/build/vendor.js?id=19867aae3b960cda7d79"></script>
<script src="/build/embed.js?id=2ff0173dd78c5c1f99c6"></script>
</body>
</html>
如您所见,它缺少一些行,这些行旁边有一个弹性按钮。 (顺便说一句,我已经缩短了其中的代码,所以我没有达到 30000 个字符的限制)。
我对直播中每 2 秒变化一次的数字 311 很感兴趣 link,它可以在 html 代码中找到
<span>311</span>
有没有办法使用 IronPython 获取此值以及任何其他值?
P.S。我是实际编码的菜鸟,这就是为什么我可能对术语有疑问,但在可视化脚本方面有一定的背景。非常感谢您的帮助。谢谢。
以防万一您有相同的查询或正在努力处理动态网络抓取。您必须使用 CPython 并安装网络爬虫,例如 Playwright 或 BS + Selenium
我使用了 playwright,它更直接,并且有一个非常受欢迎的 inner_html()
函数,可以直接读取动态 flex HTML 代码。下面是代码供参考。
#part of the help to write the script I got from
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(slow_mo=1000)
page = browser.new_page()
page.goto('https://vemcount.app/embed/widget/uOCRuLPangWo5fT')
central = page.query_selector("p.w-full span");
print({'central': central.inner_html()})
browser.close()
之后,我尝试通过批处理文件从 Grasshopper 远程 运行 .py 脚本,并通过 Grasshopper 中的 txt 或 CSV 文件读取输出。
如果有更好的方法,我很乐意听取您的建议。
你的,
Python 的初学者。 :)
我在 Grasshopper 和 Rhino 上使用 IronPython 2.7.9.0 从这个 link 上的特定小部件网络抓取数据:https://vemcount.app/embed/widget/uOCRuLPangWo5fT?locale=en
我使用的代码如下
import urllib
import os
web = urllib.urlopen(url)
html = web.read()
web.close()
html 输出包含此 link 中的所有 html 代码,但我需要的部分除外。当我在 chrome 上检查它时,它旁边有一个“弹性”按钮,如下图所示。
image that summarizes the issue I am facing
在带有“flex”按钮的行下的任何内容都不会出现在抓取结果中,而是作为空白行出现。
这是输出 html 我得到:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Central Library - Duhig North & Link</title>
<meta charset="utf-8">
<meta name="google" content="notranslate">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name="csrf-token" content="">
<link rel="stylesheet" href="/build/app.css?id=2fefc4f9faa59eebcb4b">
<link rel="stylesheet" href="https://vemcount.app/fonts/hamburg_serial/stylesheet.css">
<style>
#embed, #main {
height: 100vh;
}
.vue-grid-item {
margin-bottom: 0px !important;
}
.powered_by {
position: absolute;
bottom: 0px;
right: 0px;
background-color: rgba(0, 0, 0, 0.18);
color: #fff;
padding: 2px 5px;
font-size: 9px;
}
.powered_by:hover, .powered_by:link, .powered_by:visited {
text-decoration: none;
display: none;
}
.dashboard-widget .relative {
overflow: hidden !important;
}
</style>
<script>
window.App = {"socketAppKey":"eJSkWUHWpwolvjVcT2ZxUJZXnDpxtRljdZl74fKr","socketCluster":null,"socketHost":"websocket.vemcount.com","socketPort":443,"socketSecurePort":443,"socketDisableStats":true,"socketEncrypted":true,"locale":"en","settings":[{"name":"type","value":"{\"count_in\":\"column\"}"},{"name":"period","value":"[\"yesterday\"]"},{"name":"period_step","value":"hour"},{"name":"hide_datalabel","value":"0"},{"name":"currency","value":"AUD"},{"name":"show_days","value":"[0,1,2,3,4,5,6]"},{"name":"show_months","value":"[1,2,3,4,5,6,7,8,9,10,11,12]"},{"name":"show_hours_from","value":"00:00"},{"name":"show_hours_to","value":"23:45"},{"name":"data_heatmap","value":"blue"},{"name":"weather_metrics","value":"0"},{"name":"first_day_of_week","value":"1"},{"name":"time_format24","value":"time_format24"},{"name":"date_time_format","value":"2"},{"name":"number_grouping","value":","},{"name":"number_decimal","value":"."},{"name":"opening_hours_overlap","value":"0"},{"name":"data_output","value":"count_in"}],"sound":null};
</script>
<script src="/build/lang/en.js?v=2022.04.4"></script>
</head>
<body class="bg-transparent">
<main id="main">
<div id="embed" >
<div class="w-full h-full vue-grid-item cssTransforms" style="position: absolute;">
<live-inside :embedded="true" :widget="{"id":81438,"pane_id":4005,"title":"Central Library - Duhig North & Link","description":"Live occupancy \/ Seating capacity","x":0,"y":0,"w":2,"h":1,"bg_color":"red","text_color":"black","type":"live-inside","secret":"uOCRuLPangWo5fT","internal":"VRg4JTIRrtJ7Pwg","embeddable":1,"content":{"target":1100,"bidirectional":true,"target_enable":true,"prettify":false,"target_type":"donut","target_donut_hide_metric":false,"target_donut_target_hide_label":false,"target_visual_inside_text":null,"target_visual_available_text":null,"target_screen_ok_title":null,"target_screen_ok_text":null,"target_screen_ok_color":"#38A169","target_screen_ok_image":-1,"target_screen_warning_title":null,"target_screen_warning_pe</live-inside>
</div>
</div>
</main>
<a title=" Vemco Group A/S " class="powered_by" target="_blank"
href="http://vemcount.com">Powered by
<b>vemcount.com</b>
</a>
<script src="/build/manifest.js?id=7f2e9aa3431c681a4683"></script>
<script src="/build/vendor.js?id=19867aae3b960cda7d79"></script>
<script src="/build/embed.js?id=2ff0173dd78c5c1f99c6"></script>
</body>
</html>
如您所见,它缺少一些行,这些行旁边有一个弹性按钮。 (顺便说一句,我已经缩短了其中的代码,所以我没有达到 30000 个字符的限制)。
我对直播中每 2 秒变化一次的数字 311 很感兴趣 link,它可以在 html 代码中找到
<span>311</span>
有没有办法使用 IronPython 获取此值以及任何其他值?
P.S。我是实际编码的菜鸟,这就是为什么我可能对术语有疑问,但在可视化脚本方面有一定的背景。非常感谢您的帮助。谢谢。
以防万一您有相同的查询或正在努力处理动态网络抓取。您必须使用 CPython 并安装网络爬虫,例如 Playwright 或 BS + Selenium
我使用了 playwright,它更直接,并且有一个非常受欢迎的 inner_html()
函数,可以直接读取动态 flex HTML 代码。下面是代码供参考。
#part of the help to write the script I got from
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(slow_mo=1000)
page = browser.new_page()
page.goto('https://vemcount.app/embed/widget/uOCRuLPangWo5fT')
central = page.query_selector("p.w-full span");
print({'central': central.inner_html()})
browser.close()
之后,我尝试通过批处理文件从 Grasshopper 远程 运行 .py 脚本,并通过 Grasshopper 中的 txt 或 CSV 文件读取输出。
如果有更好的方法,我很乐意听取您的建议。
你的,
Python 的初学者。 :)