Django：将HTML（包含形式）解析为字典

Question

我在服务器端创建了一个 html 表单。

<form action="." method="POST">
 <input type="text" name="foo" value="bar">
 <textarea name="area">long text</textarea>
 <select name="your-choice">
  <option value="a" selected>A</option>
  <option value="b">B</option>
 </select>
</form>

想要的结果：

{
 "foo": "bar",
 "area": "long text",
 "your-choice": "a",
}

我正在寻找的方法（parse_form()）可以这样使用：

response = client.get('/foo/')

# response contains <form> ...</form>

data = parse_form(response.content)

data['my-input']='bar'

response = client.post('/foo/', data)

如何在Python中实现parse_form()？

这与 Django 无关，尽管如此，Django 中有一个功能请求，但几年前被拒绝了：https://code.djangoproject.com/ticket/11797

更新

我围绕基于 lxml 的答案编写了一个小型 Python 库：html_form_to_dict

Answer 1

首先，考虑使用response.context代替response.content。正如记录在案 here，它为您提供了用于呈现 response.content 的模板参数。如果您将它们作为参数提供给渲染器，则您需要的表单属性（名称和值）可能就在其中。

如果你必须使用response.content，那么我认为Django 没有提供解析HTML 响应的方法。您可以使用 HTML 解析器，例如 beautifulsoup，或使用正则表达式。

Answer 2

这与django无关，只是为了html解析。标准工具是 BeautifulSoup (bs4) 库。

它解析任意HTML，经常用于网络抓取工具（包括我自己的）。这个问题涵盖了解析 html 形式：Python beautiful soup form input parsing，几乎所有你需要的东西都在这里的某个地方得到了回答:)

from bs4 import BeautifulSoup

def selected_option(select):
    option = select.find("option", selected=True)
    if option: 
        return option['value']

# tag name => how to extract its value
tags = {  
    "input": lambda t: t['value'],
    "textarea": lambda t: t.text,
    "select": selected_option
}


def parse_form(html):
    soup = BeautifulSoup(html, 'html.parser')
    form = soup.find("form")
    return {
        e['name']: tags[e.name](e)
        for e in form.find_all(tags.keys())
    }

这为您的输入提供了以下输出：

{
    "foo": "bar",
    "area": "long text",
    "your-choice": "a"
}

对于生产，您将要添加大量的错误检查，用于未找到表单、没有名称的输入等。这取决于具体需要什么。

Answer 3

from collections import UserDict

class FormData(UserDict):
    def __init__(self, *args, **kwargs):
        self.frozen = False
        super().__init__(*args, **kwargs)
        self.frozen = True
        
    def __setitem__(self, key, value):
        if self.frozen and key not in self:
            raise ValueError('Key %s is not in the dict. Available: %s' % (
                key, self.keys()
            ))
        super().__setitem__(key, value)

def parse_form(content):
    """
    Parse the first form in the html in content.
    """
    
    import lxml.html
    tree = lxml.html.fromstring(content)
    return FormData(tree.forms[0].fields)

用法示例：

def test_foo_form(user_client):
    url = reverse('foo')
    response = user_client.get(url)
    assert response.status_code == 200
    data = parse_form(response.content)
    response = user_client.post(url, data)
    assert response.status_code == 302

以上代码不完整。请不要复制+粘贴，而是使用库：https://github.com/guettli/html_form_to_dict

Answer 4

为了好玩，我尝试用 BeatifulSoap 复制 guettli 提出的解决方案。

这是我得出的结论：

from bs4 import BeautifulSoup


def parse_form(content):
    data = {}
    html = BeautifulSoup(content, features="lxml")
    form = html.find('form', recursive=True)
    fields = form.find_all(('input', 'select', 'textarea'))
    for field in fields:
        name = field.get('name')
        if name:
            if field.name == 'input':
                value = field.get('value')
            elif field.name == 'select':
                try:
                    value = field.find_all('option', selected=True)[0].get('value')
                except:
                    value = None
            elif field.name == 'textarea':
                value = field.text
            else:
                # checkbox ? radiobutton ? file ? 
                continue
            data[name] = value
    return data

这是更好的结果吗？

老实说，我不这么认为；另一方面，如果你碰巧使用BS以其他方式解析响应内容，这可能是一个选项。

Answer 5

为什么不只是这个？：

def parse_form(content):
    import lxml.html
    tree = lxml.html.fromstring(content)
    return dict(tree.forms[0].fields)

我猜不出使用 UserDict 的原因

一个小警告：我注意到当表单包含

Django：将HTML（包含形式）解析为字典

Django: Parse HTML (containing form) to dictionary

python

django

html-parsing

更新