从 html 文本中提取标签信息
extract tag info from html text
我正在尝试抓取 webpage.i 得到以下文本。如何从以下字符串中提取 src 信息。谁能告诉我我们如何从文本中提取任何键值数据的过程
<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>
和 textarea 标签内的文本。
<textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>
既然你在标签中提到了 beautifulsoup
,我假设你想用它来解析你的 html 内容。
import bs4
content = """<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>
<textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>
"""
soup = bs4.BeautifulSoup(content, 'lxml')
img = soup.find('img') # locate img tag
text_area = soup.find('textarea') # locate textarea tag
print img['id'] # print value of 'id' attribute in img tag
print img['src'] # print value of 'src' attribute
print text_area.text # print content in this tag
beautifulsoup 可以帮助:
一个标签可以有任意数量的属性。该标签有一个属性“class”,其值为“boldest”。您可以通过将标签视为字典来访问标签的属性:
tag['class']
# u'boldest'
您可以直接访问该词典作为 .attrs:
tag.attrs
# {u'class': u'boldest'}
你可以通过 .text
从标签中获取文本
tag.text
我正在尝试抓取 webpage.i 得到以下文本。如何从以下字符串中提取 src 信息。谁能告诉我我们如何从文本中提取任何键值数据的过程
<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>
和 textarea 标签内的文本。
<textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>
既然你在标签中提到了 beautifulsoup
,我假设你想用它来解析你的 html 内容。
import bs4
content = """<img id="imgsglx2" onerror="this.alt=not select the picture or pictures cannot be displayed" src="http://114.255.167.200:8092/cidasEN/extend/sglx_images/UTYP/221.jpg" style=" border: 0; padding: 0; margin: 0;height:110px;width:110px; "/>
<textarea id="sgmsbck" name="sgms" style="width:98%;height:120px">On August. 19, 2014\uff0c08:30, Mr. Xiao who drove lu K9**** MPV from south to north along the TaiShang south Road, when Mr. Xiao drove lu K9**** MPV turn west at the crossing of Chengshan road and TaiShang south road, RongCheng City. Due to wrong behavior towards pedestrians at pedestrian crossings, the left part of the lu K9**** MPV impacted with Mr. Song(Pedestrian) from south to north across ChengShan Road of the pedestrian crossings. Causing the lu K9**** MPV damaged, Mr. Song injured.</textarea>
"""
soup = bs4.BeautifulSoup(content, 'lxml')
img = soup.find('img') # locate img tag
text_area = soup.find('textarea') # locate textarea tag
print img['id'] # print value of 'id' attribute in img tag
print img['src'] # print value of 'src' attribute
print text_area.text # print content in this tag
beautifulsoup 可以帮助:
一个标签可以有任意数量的属性。该标签有一个属性“class”,其值为“boldest”。您可以通过将标签视为字典来访问标签的属性:
tag['class']
# u'boldest'
您可以直接访问该词典作为 .attrs:
tag.attrs
# {u'class': u'boldest'}
你可以通过 .text
从标签中获取文本tag.text