使用 XSD 中的属性字段获取标签路径

Question

我当前的任务是从 XSD 文件中获取信息（字段类型、字段名称等）。我的 XSD 文件看起来像这样：

<?xml version="1.0" encoding="UTF-8"?>
<!-- edited with XMLSpy v2018 rel. 2 sp1 (x64) (http://www.altova.com) by test (123321) -->
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified">
    <xs:complexType name="attribute">
        <xs:annotation>
            <xs:documentation>Атрибуты ОГХ</xs:documentation>
        </xs:annotation>
        <xs:sequence>
            <xs:element name="owner_id">
                <xs:annotation>
                    <xs:documentation>Данные о балансодержателе</xs:documentation>
                </xs:annotation>
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="legal_person" type="xs:integer">
                            <xs:annotation>
                                <xs:documentation>ID балансодержателя</xs:documentation>
                            </xs:annotation>
                        </xs:element>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
            <xs:element name="snow_clean_area" type="xs:double">
                <xs:annotation>
                    <xs:documentation>Площадь вывоза снега, кв. м</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:schema>

正如我们所看到的，有一些字段与其他内部（嵌套）。

我需要获取 XSD 中所有元素的名称。但是如果一个元素在另一个元素中，我需要将名称写为“all_prev_names;cur_name”。对于我之前显示的 XSD，它将是：

"owner_id;legal_person"
"snow_clean_area"

为了更多的嵌套，名称必须包含所有以前的名称。

我写了那个代码：

        def recursive(xml, name=None):
            res = xml.find_all('xs:element')

            if res:
                for elem in res:
                    if name:
                        yield from recursive(elem, elem['name'] + ';' + name)
                    else:
                        yield from recursive(elem, elem['name'])
            else:
                if name:
                    yield (name)
                else:
                    yield (xml['name'])

但是路径重复有问题。该函数的结果将是：

"owner_id;legal_person"
"legal_person"
"snow_clean_area"

我需要修复该代码，或者想出另一个想法，如何解决该任务。

Answer 1

如果您想处理任何 XSD，这将是一个非常艰巨的挑战，因为 XSD 作者有很多不同的方法可以让您感到困难 - 类型限制和扩展、替换组、命名模型组和属性组、xsd:import、xsd:redefine 等。另一方面，如果您只需要处理一个模式，那么您就不会这样做;所以你必须决定允许多少变化。

从已经使用模式处理器处理过的已编译模式工作通常比从源 XSD 文件工作要容易得多，并且会处理许多相同的变化事情可以用不同的方式来写。例如，已编译的模式可能会扩展替换组，就好像它是使用 xsd:choice.

编写的一样

假设您处于 python 世界，一种方法是使用 Saxon 模式处理器将源模式编译成 SCM 文件（SCM = 模式组件模型）。 SCM 文件仍然是 XML，但它被扁平化和规范化，应用程序更容易从中提取信息。

（我不知道 xmlproc 是否有一个 API 允许您访问已编译的模式 - 如果有，那将是另一种方法。）

请注意，如果您尝试生成 owner_id;legal_person 之类的路径，模式可以递归并允许无限嵌套，因此这种方法可能会导致您尝试生成无限路径（这可能因堆栈溢出而失败）。您还需要注意通配符 (xs:any)。

Answer 2

使用 xml2xpath.sh to generate an xml from the xsd and get the XPath expressions: xml2xpath.sh -a -f root -d test.xsd. Requires xmlbeans 包。

所提供的示例无法开箱即用，但下面的示例可以。 xsd2inst 来自 xmlbeans 包状态

的实用程序帮助

Generates a document based on the given Schema file having the given element as root. The tool makes reasonable attempts to create a valid document, but this is not always possible since, for example, there are schemas for which no valid instance document can be produced.

鉴于此 XSD

<?xml version="1.0" encoding="UTF-8"?>
<!-- edited with XMLSpy v2018 rel. 2 sp1 (x64) (http://www.altova.com) by test (123321) -->
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified">
<xs:element name="root">
    <xs:complexType>
        <xs:annotation>
            <xs:documentation>Атрибуты ОГХ</xs:documentation>
        </xs:annotation>
        <xs:sequence>
            <xs:element name="owner_id">
                <xs:annotation>
                    <xs:documentation>Данные о балансодержателе</xs:documentation>
                </xs:annotation>
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="legal_person" type="xs:integer">
                            <xs:annotation>
                                <xs:documentation>ID балансодержателя</xs:documentation>
                            </xs:annotation>
                        </xs:element>
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
            <xs:element name="snow_clean_area" type="xs:double">
                <xs:annotation>
                    <xs:documentation>Площадь вывоза снега, кв. м</xs:documentation>
                </xs:annotation>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>
</xs:schema>

该实用程序将 return

xml2xpath.sh -a -f root -d test.xsd 
Creating XML instance starting at element root from test.xsd

xml2xpath: find XPath expressions on /tmp/tmp.FJQYKaDZI0
================================================================================ (2021-10-22 16:39:09 -03)

   -a ; 'abs_path=1'
   -f ; 'tag1=root'
   -d
================================================================================ (2021-10-22 16:39:09 -03)

Namespaces: None
================================================================================ (2021-10-22 16:39:09 -03)

Elements to process (build xpath, add prefix) 4

XPath expressions found: 4 (absolute, unique elements, use -r to override)
================================================================================ (2021-10-22 16:39:09 -03)

/root
/root/owner_id
/root/owner_id/legal_person
/root/snow_clean_area


received EXIT, bye!
================================================================================ (2021-10-22 16:39:09 -03)

xmllint 和 xpath 也可用于获取 name, type 属性，但需要更多解析

(echo "setrootns"; echo "xpath //xs:element/@*" ; echo "bye") | xmllint --shell test.xsd
/ > setrootns
/ > xpath //xs:element/@*
Object is a Node Set :
Set contains 6 nodes:
1  ATTRIBUTE name
    TEXT
      content=root
2  ATTRIBUTE name
    TEXT
      content=owner_id
3  ATTRIBUTE name
    TEXT
      content=legal_person
4  ATTRIBUTE type
    TEXT
      content=xs:integer
5  ATTRIBUTE name
    TEXT
      content=snow_clean_area
6  ATTRIBUTE type
    TEXT
      content=xs:double
/ > bye

备选

(echo "setrootns"; echo "cat //xs:element/@*" ; echo "bye") | xmllint --shell test.xsd
/ > setrootns
/ > cat //xs:element/@*
 -------
 name="root"
 -------
 name="owner_id"
 -------
 name="legal_person"
 -------
 type="xs:integer"
 -------
 name="snow_clean_area"
 -------
 type="xs:double"
/ > bye

Answer 3

我找到了适合我的解决方案。我使用 ElementTree.iterparse，而不是 BeautifulSoup。然后，在每个元素之后我保存我的字段，并在标记的末尾将其保存到我的结构中：

def getXsd(self, typeNumber: int) -> t.List[t.Dict[str, str]]:
    paths = []
    for elem in self.xsds:
        if elem[0] == typeNumber:
            events = ("start", "end")
            codes = []
            type_field = None
            for event, elem in ET.iterparse(BytesIO(elem[1].encode("UTF-8")), events=events):
                if event == 'start' and elem.tag == '{http://www.w3.org/2001/XMLSchema}element':
                    codes.append(elem.attrib['name'])
                    if 'type' in elem.attrib:
                        type_field = elem.attrib['type']
                elif event == 'start' and elem.tag == '{http://www.w3.org/2001/XMLSchema}documentation':
                    if codes and type_field:
                        paths.append({'code': "".join([str(item).capitalize() for item in codes[::-1]]),
                                     'type': type_field,
                                     'name': elem.text})
                        type_field = None

                elif event == 'end' and elem.tag == '{http://www.w3.org/2001/XMLSchema}element':
                    codes.pop()
    return paths

结果是：

[{'code': 'Legal_personOwner_id', 'type': 'xs:integer', 'name': 'ID балансодержателя'}, {'code': 'Legal_personCustomer_id', 'type': 'xs:integer', 'name': 'ID заказчика'}, {'code': 'Improvement_object_categoryImprovement_object_category_id', 'type': 'xs:integer', 'name': 'Код категории озеленения'}, {'code': 'Legal_personDepartment_id', 'type': 'xs:integer', 'name': 'ID ведомственного ОИВ'}, {'code': 'Snow_clean_area', 'type': 'xs:double', 'name': 'Площадь вывоза снега, кв. м'}, {'code': 'Reservoir_area', 'type': 'xs:double', 'name': 'Водоемы, кв. м'}]

使用 XSD 中的属性字段获取标签路径

Get path of tags using attribute field in XSD

python

xml

xsd

beautifulsoup