Beautiful Soup - `findAll` 没有捕获 SVG 中的所有标签(`ElementTree` 可以)

Beautiful Soup - `findAll` not capturing all tags in SVG (`ElementTree` does)

我试图通过修改 SVG map depicting all counties in the US. The basic approach is captured by Flowing Data. Since SVG is basically just XML, the approach leverages the BeautifulSoup 解析器来生成等值线图。

事实是,解析器不会捕获 SVG 文件中的所有 path 元素。以下仅捕获了 149 条路径(超过 3000 条):

#Open SVG file
svg=open(shp_dir+'USA_Counties_with_FIPS_and_names.svg','r').read()

#Parse SVG
soup = BeautifulSoup(svg, selfClosingTags=['defs','sodipodi:namedview'])

#Identify counties
paths = soup.findAll('path')

len(paths)

但是,我知道,从物理检查和 ElementTree 方法使用以下例程捕获 3,143 条路径的事实来看,还有更多存在:

#Parse SVG
tree = ET.parse(shp_dir+'USA_Counties_with_FIPS_and_names.svg')

#Capture element
root = tree.getroot()

#Compile list of IDs from file
ids=[]
for child in root:
    if 'path' in child.tag:
        ids.append(child.attrib['id'])

len(ids)

我还没有想出如何从 ElementTree 对象中以一种不会完全混乱的方式写入。

#Define style template string
style='font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;'+\
        'stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;'+\
        'stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'

#For each path...
for child in root:
    #...if it is a path....
    if 'path' in child.tag:
        try:
            #...update the style to the new string with a county-specific color...
            child.attrib['style']=style+col_map[child.attrib['id']]
        except:
            #...if it's not a county we have in the ACS, leave it alone
            child.attrib['style']=style+'#d0d0d0'+'\n'

#Write modified SVG to disk
tree.write(shp_dir+'mhv_by_cty.svg')

上面的 modification/write 例程产生了这个怪物:

我的主要问题是:为什么 BeautifulSoup 未能捕获所有 path 标签?其次,为什么用 ElementTree 对象修改的图像会发生所有这些课外活动 activity?任何建议将不胜感激。

您需要执行以下操作:

  • 升级到beautifulsoup4:

    pip install beautifulsoup4 -U
    
  • 将其导入为:

    from bs4 import BeautifulSoup
    
  • 安装最新的lxml模块:

    pip install lxml -U
    
  • 明确指定 lxml 作为解析器:

    soup = BeautifulSoup(svg, 'lxml')
    

演示:

>>> from bs4 import BeautifulSoup
>>> 
>>> svg = open('USA_Counties_with_FIPS_and_names.svg','r').read()
>>> soup = BeautifulSoup(svg, 'lxml')
>>> paths = soup.findAll('path')
>>> len(paths)
3143

alexce 对您的第一个问题的回答是正确的。关于你的第二个问题:

why would the image modified with the ElementTree objects have all of that extracurricular activity going on?"

答案很简单 - 并非每个 <path> 元素都绘制一个县。具体来说,有两个元素,一个是 id="State_Lines",一个是 id="separator",应该被删除。你没有提供你的颜色数据集,所以我只是使用了一个随机的十六进制颜色生成器(改编自 here) for each county, then used lxml 来解析 .svg 的 XML 并遍历每个 <path> 元素,跳过我上面提到的那些:

from lxml import etree as ET
import random

def random_color():
    r = lambda: random.randint(0,255)
    return '#%02X%02X%02X' % (r(),r(),r())

new_style = 'font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'

tree = ET.parse('USA_Counties_with_FIPS_and_names.svg')
root = tree.getroot()
for child in root:
    if 'path' in child.tag and child.attrib['id'] not in ["separator", "State_Lines"]:
        child.attrib['style'] = new_style + random_color()

tree.write('counties_new.svg')

生成这张漂亮的图片: