如何使用 ironpython 提取特定的 html 行(使用 flex 容器)?

How to extract specific html lines (with a flex container) using ironpython?

我在 Grasshopper 和 Rhino 上使用 IronPython 2.7.9.0 从这个 link 上的特定小部件网络抓取数据:https://vemcount.app/embed/widget/uOCRuLPangWo5fT?locale=en

我使用的代码如下

import urllib
import os

web = urllib.urlopen(url)
html = web.read()
web.close()

html 输出包含此 link 中的所有 html 代码,但我需要的部分除外。当我在 chrome 上检查它时,它旁边有一个“弹性”按钮,如下图所示。

image that summarizes the issue I am facing

在带有“flex”按钮的行下的任何内容都不会出现在抓取结果中,而是作为空白行出现。

这是输出 html 我得到:

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Central Library - Duhig North &amp; Link</title>

    <meta charset="utf-8">
    <meta name="google" content="notranslate">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <meta name="csrf-token" content="">
    <link rel="stylesheet" href="/build/app.css?id=2fefc4f9faa59eebcb4b">
    <link rel="stylesheet" href="https://vemcount.app/fonts/hamburg_serial/stylesheet.css">

    <style>
        #embed, #main {
            height: 100vh;
        }

        .vue-grid-item {
            margin-bottom: 0px !important;
        }

        .powered_by {
            position: absolute;
            bottom: 0px;
            right: 0px;
            background-color: rgba(0, 0, 0, 0.18);
            color: #fff;
            padding: 2px 5px;
            font-size: 9px;
        }

        .powered_by:hover, .powered_by:link, .powered_by:visited {
            text-decoration: none;
            display: none;
        }

        .dashboard-widget .relative {
            overflow: hidden !important;
        }

        
    </style>

    <script>
        window.App = {"socketAppKey":"eJSkWUHWpwolvjVcT2ZxUJZXnDpxtRljdZl74fKr","socketCluster":null,"socketHost":"websocket.vemcount.com","socketPort":443,"socketSecurePort":443,"socketDisableStats":true,"socketEncrypted":true,"locale":"en","settings":[{"name":"type","value":"{\"count_in\":\"column\"}"},{"name":"period","value":"[\"yesterday\"]"},{"name":"period_step","value":"hour"},{"name":"hide_datalabel","value":"0"},{"name":"currency","value":"AUD"},{"name":"show_days","value":"[0,1,2,3,4,5,6]"},{"name":"show_months","value":"[1,2,3,4,5,6,7,8,9,10,11,12]"},{"name":"show_hours_from","value":"00:00"},{"name":"show_hours_to","value":"23:45"},{"name":"data_heatmap","value":"blue"},{"name":"weather_metrics","value":"0"},{"name":"first_day_of_week","value":"1"},{"name":"time_format24","value":"time_format24"},{"name":"date_time_format","value":"2"},{"name":"number_grouping","value":","},{"name":"number_decimal","value":"."},{"name":"opening_hours_overlap","value":"0"},{"name":"data_output","value":"count_in"}],"sound":null};
    </script>

    <script src="/build/lang/en.js?v=2022.04.4"></script>

</head>

<body class="bg-transparent">

<main id="main">
    <div id="embed" >
        
    <div class="w-full h-full vue-grid-item cssTransforms" style="position: absolute;">

        
        
        
        
        
        
        
        
                    <live-inside :embedded="true" :widget="{&quot;id&quot;:81438,&quot;pane_id&quot;:4005,&quot;title&quot;:&quot;Central Library - Duhig North &amp; Link&quot;,&quot;description&quot;:&quot;Live occupancy \/ Seating capacity&quot;,&quot;x&quot;:0,&quot;y&quot;:0,&quot;w&quot;:2,&quot;h&quot;:1,&quot;bg_color&quot;:&quot;red&quot;,&quot;text_color&quot;:&quot;black&quot;,&quot;type&quot;:&quot;live-inside&quot;,&quot;secret&quot;:&quot;uOCRuLPangWo5fT&quot;,&quot;internal&quot;:&quot;VRg4JTIRrtJ7Pwg&quot;,&quot;embeddable&quot;:1,&quot;content&quot;:{&quot;target&quot;:1100,&quot;bidirectional&quot;:true,&quot;target_enable&quot;:true,&quot;prettify&quot;:false,&quot;target_type&quot;:&quot;donut&quot;,&quot;target_donut_hide_metric&quot;:false,&quot;target_donut_target_hide_label&quot;:false,&quot;target_visual_inside_text&quot;:null,&quot;target_visual_available_text&quot;:null,&quot;target_screen_ok_title&quot;:null,&quot;target_screen_ok_text&quot;:null,&quot;target_screen_ok_color&quot;:&quot;#38A169&quot;,&quot;target_screen_ok_image&quot;:-1,&quot;target_screen_warning_title&quot;:null,&quot;target_screen_warning_pe</live-inside>
            
            
                
        
        
        
                
    </div>

    </div>
</main>

<a title=" Vemco Group A/S " class="powered_by" target="_blank"
   href="http://vemcount.com">Powered by
    <b>vemcount.com</b>
</a>

<script src="/build/manifest.js?id=7f2e9aa3431c681a4683"></script>
<script src="/build/vendor.js?id=19867aae3b960cda7d79"></script>
<script src="/build/embed.js?id=2ff0173dd78c5c1f99c6"></script>

</body>
</html>

如您所见,它缺少一些行,这些行旁边有一个弹性按钮。 (顺便说一句,我已经缩短了其中的代码,所以我没有达到 30000 个字符的限制)。

我对直播中每 2 秒变化一次的数字 311 很感兴趣 link,它可以在 html 代码中找到

<span>311</span>

有没有办法使用 IronPython 获取此值以及任何其他值?

P.S。我是实际编码的菜鸟,这就是为什么我可能对术语有疑问,但在可视化脚本方面有一定的背景。非常感谢您的帮助。谢谢。

以防万一您有相同的查询或正在努力处理动态网络抓取。您必须使用 CPython 并安装网络爬虫,例如 Playwright 或 BS + Selenium

我使用了 playwright,它更直接,并且有一个非常受欢迎的 inner_html() 函数,可以直接读取动态 flex HTML 代码。下面是代码供参考。

#part of the help to write the script I got from 

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(slow_mo=1000)

    page = browser.new_page()
    page.goto('https://vemcount.app/embed/widget/uOCRuLPangWo5fT')
    central = page.query_selector("p.w-full span");
    print({'central': central.inner_html()})
        
    browser.close()
 

之后,我尝试通过批处理文件从 Grasshopper 远程 运行 .py 脚本,并通过 Grasshopper 中的 txt 或 CSV 文件读取输出。

如果有更好的方法,我很乐意听取您的建议。

你的,

Python 的初学者。 :)