使用 BeautifulSoup 查找具有特定字符串的 JavaScript 变量

Finding JavaScript variable with certain string with BeautifulSoup

我有一个棘手的任务,我需要在 JavaScript 变量中找到一些 HTML 并遍历它。

变量如下所示:

<script>
var someVar = new something.Something({
    content: 'This text has to be found<br /><table></table>',
    size: 230
)};
....
</script>

不知道JS变量的名字,只好根据This text has to be foundsnippet/string找了。后来验证确实是一个JS变量,于是想取值<br /><table></table>来遍历

在这种情况下,一种方法是使用 javascript 解析器 slimit。思路是找到所有脚本标签,遍历它们,解析代码,遍历语法树,检查每个赋值节点右边是否有你要找的文本:

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor

data = """
<script>
var someVar = new something.Something({
    content: 'This text has to be found<br /><table></table>',
    size: 230
});
</script>
"""
text_to_find = 'This text has to be found'

soup = BeautifulSoup(data)

for script in soup.find_all('script'):
    parser = Parser()
    tree = parser.parse(script.text)
    for node in nodevisitor.visit(tree):
        if isinstance(node, ast.Assign):
            value = getattr(node.right, 'value', '')
            if text_to_find in value:
                print value

打印 'This text has to be found<br /><table></table>'.

我不确定它是否完全符合您的需求,但希望这至少是一个开始。

另请参阅:

  • JavaScript parser in Python
  • Extracting text from script tag using BeautifulSoup in Python