从 html 文本中提取标签信息

Question

我正在尝试抓取 webpage.i 得到以下文本。如何从以下字符串中提取 src 信息。谁能告诉我我们如何从文本中提取任何键值数据的过程

<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>

和 textarea 标签内的文本。

  <textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>

Answer 1

既然你在标签中提到了 beautifulsoup，我假设你想用它来解析你的 html 内容。

import bs4

content = """<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>
<textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>
"""

soup = bs4.BeautifulSoup(content, 'lxml')

img = soup.find('img') # locate img tag
text_area = soup.find('textarea') # locate textarea tag

print img['id'] # print value of 'id' attribute in img tag
print img['src'] # print value of 'src' attribute
print text_area.text # print content in this tag

Answer 2

beautifulsoup 可以帮助：

一个标签可以有任意数量的属性。该标签有一个属性“class”，其值为“boldest”。您可以通过将标签视为字典来访问标签的属性：

tag['class']

# u'boldest'

您可以直接访问该词典作为 .attrs:

tag.attrs
# {u'class': u'boldest'}

你可以通过 .text

从标签中获取文本

tag.text

从 html 文本中提取标签信息

extract tag info from html text

python

mechanize

beautifulsoup

web-scraping