使用 BeautifulSoup 在结果集上排除标签 (<topic>) 内的标签 (<pattern>)
Exclude a tag (<pattern>) inside a tag (<topic>) on a result set using BeautifulSoup
我刚开始使用 Python 进行网络抓取,目前我正在使用 BeautifulSoup 进行数据提取。我有这个 .aiml 文件(xml),其中我想从标签 pattern 中提取所有数据,这些数据在 NOT INCLUDED 里面主题 标签。
我已经获得了所有模式值,但这里的挑战是,那些具有 topic 父标记的模式不应包含在结果中设置。
这是 aiml 文件:
<?xml version = "1.0" encoding = "UTF-8"?>
<aiml version="1.0.1" encoding="UTF-8">
<topic name="botdog">
<category>
<pattern>MY DOG'S NAME IS *</pattern>
<template>
That is interesting that you have a dog named <set name="dog"><star/></set>
</template>
</category>
<category>
<pattern>WHAT IS MY DOG'S NAME</pattern>
<template>
Your dog's name is <get name="dog"/>.
</template>
</category>
</topic>
<topic name="botcat">
<category>
<pattern>MY CAT'S NAME IS *</pattern>
<template>
That is interesting that you have a cat named <set name="cat"><star/></set>
</template>
</category>
<category>
<pattern>WHAT IS MY CAT'S NAME</pattern>
<template>
Your cat's name is <get name="cat"/>.
</template>
</category>
</topic>
<category>
<pattern>HELLO ALICE</pattern>
<template>
Hello User
</template>
</category>
<category>
<pattern>HOW ARE YOU</pattern>
<template>
I'm fine
</template>
</category>
</aiml>
Python 代码(烧瓶):
@extract.route('/')
def index_page():
folder = 'templates/topic.aiml'
with open(folder, 'r') as myfile:
soup = BeautifulSoup(myfile.read(), 'html.parser')
data_topic = [match.pattern.text for match in soup.find_all('category')]
print(data_topic)
# data = " ".join(data_set)
return jsonify({'data_set': data_topic})
Return print() 的值是:
["MY DOG'S NAME IS *", "WHAT IS MY DOG'S NAME", "MY CAT'S NAME IS *", "WHAT IS MY CAT'S NAME", 'HELLO ALICE', 'HOW ARE YOU']
应该是这样的,因为它没有父标签 topic:
['HELLO ALICE', 'HOW ARE YOU']
试试这个:
@extract.route('/')
def index_page():
folder = 'templates/topic.aiml'
with open(folder, 'r') as myfile:
soup = BeautifulSoup(myfile.read(), 'html.parser')
data = []
for cat in soup.find_all('category'):
if cat.parent.name == "topic": continue
data += [cat.find("pattern").text]
print(data)
return jsonify({'data_set': data})
希望对您有所帮助!查看 docs 以获取更多示例。
我刚开始使用 Python 进行网络抓取,目前我正在使用 BeautifulSoup 进行数据提取。我有这个 .aiml 文件(xml),其中我想从标签 pattern 中提取所有数据,这些数据在 NOT INCLUDED 里面主题 标签。
我已经获得了所有模式值,但这里的挑战是,那些具有 topic 父标记的模式不应包含在结果中设置。
这是 aiml 文件:
<?xml version = "1.0" encoding = "UTF-8"?>
<aiml version="1.0.1" encoding="UTF-8">
<topic name="botdog">
<category>
<pattern>MY DOG'S NAME IS *</pattern>
<template>
That is interesting that you have a dog named <set name="dog"><star/></set>
</template>
</category>
<category>
<pattern>WHAT IS MY DOG'S NAME</pattern>
<template>
Your dog's name is <get name="dog"/>.
</template>
</category>
</topic>
<topic name="botcat">
<category>
<pattern>MY CAT'S NAME IS *</pattern>
<template>
That is interesting that you have a cat named <set name="cat"><star/></set>
</template>
</category>
<category>
<pattern>WHAT IS MY CAT'S NAME</pattern>
<template>
Your cat's name is <get name="cat"/>.
</template>
</category>
</topic>
<category>
<pattern>HELLO ALICE</pattern>
<template>
Hello User
</template>
</category>
<category>
<pattern>HOW ARE YOU</pattern>
<template>
I'm fine
</template>
</category>
</aiml>
Python 代码(烧瓶):
@extract.route('/')
def index_page():
folder = 'templates/topic.aiml'
with open(folder, 'r') as myfile:
soup = BeautifulSoup(myfile.read(), 'html.parser')
data_topic = [match.pattern.text for match in soup.find_all('category')]
print(data_topic)
# data = " ".join(data_set)
return jsonify({'data_set': data_topic})
Return print() 的值是:
["MY DOG'S NAME IS *", "WHAT IS MY DOG'S NAME", "MY CAT'S NAME IS *", "WHAT IS MY CAT'S NAME", 'HELLO ALICE', 'HOW ARE YOU']
应该是这样的,因为它没有父标签 topic: ['HELLO ALICE', 'HOW ARE YOU']
试试这个:
@extract.route('/')
def index_page():
folder = 'templates/topic.aiml'
with open(folder, 'r') as myfile:
soup = BeautifulSoup(myfile.read(), 'html.parser')
data = []
for cat in soup.find_all('category'):
if cat.parent.name == "topic": continue
data += [cat.find("pattern").text]
print(data)
return jsonify({'data_set': data})
希望对您有所帮助!查看 docs 以获取更多示例。