从 google 表单中获取字段 ID，python BeautifulSoup

Question

例如，在这样的 google 表单中：您将如何创建此 'field IDs'

的列表

var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0]
]
]
,[2054606931,"SKU",null,0,[[742914399,null,0]
]
]
,[1620039602,"Size",null,0,[[2011436433,null,0]
]
]
,[445859665,"First Name",null,0,[[638818998,null,0]
]
]
,[1417046530,"Last Name",null,0,[[1952962866,null,0]
]
]
,[903472958,"E-mail",null,0,[[916445513,null,0]
]
]
,[549969484,"Phone Number",null,0,[[848461347,null,0

这是HTML^

的相关部分

目前我有代码：

    from bs4 import BeautifulSoup as bs
    a = requests.get(url, proxies=proxies)
    soup = bs(a.text, 'html.parser')
    fields = soup.find_all('script', {'type': 'text/javascript'})
    form_info = fields[1]
    print(form_info)

但是这个 returns，很多不相关的数据，除非我包含很多 str.replace()，str.split() 代码部分，否则我看不到一个简单的方法来做到这一点。那也太乱了吧

我不必使用 BeautifulSoup 虽然这似乎是显而易见的方法。

在上面的例子中，我需要一个像这样的列表：

[1089277187, 742914399, 2011436433, 638818998, 1952962866, 916445513, 848461347]

Answer 1

美汤用于查询HTML个标签。因此，从 JavaScript 变量中提取数据的方法是使用正则表达式。您可以在 [[ 上进行匹配。然而，这将 return 831400739。这可以通过跳过第一项在正则表达式之后手动排除。

import re

script = '''var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0]
]
]
,[2054606931,"SKU",null,0,[[742914399,null,0]
]
]
,[1620039602,"Size",null,0,[[2011436433,null,0]
]
]
,[445859665,"First Name",null,0,[[638818998,null,0]
]
]
,[1417046530,"Last Name",null,0,[[1952962866,null,0]
]
]
,[903472958,"E-mail",null,0,[[916445513,null,0]
]
]
,[549969484,"Phone Number",null,0,[[848461347,null,0'''

match = re.findall('(?<=\[\[)(\d+)', script) 
# (?<= ) means to look for the following (but not include it in the results):
# \[\[ means find 2 square brackets characters. The backslash is used to tell regex to use the character [ and not the function.
# (\d+) means to match the start of a digit of any size (and return it in results)

results = [x for x in match[1:]] # Skip the first item, which is 831400739
print(results)

这将输出：

['1089277187', '742914399', '2011436433', '638818998', '1952962866', '916445513', '848461347']

您可能希望将结果转换为整数。此外，为了使代码更健壮，您可能希望在调用正则表达式函数之前删除空格和换行符，例如：formatted = script.replace(" ", "").replace('\n', '').replace('\r', '')

从 google 表单中获取字段 ID，python BeautifulSoup

Get field ids from a google form, python BeautifulSoup

python

beautifulsoup

google-forms