使用 beautifulsoup 在 python 中解析 html
Parse html in python using beautifulsoup
如何从 html 页面获得如下输出?
>html_sting='''<td class="status_icon" rowspan="2"><img alt="QUEUED" src="images/arts/status_QUEUED.png" style="border:none" title="QUEUED"/></td>
><td class="test"> v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
> <div class="start">(04.02) 23:29</div>
> <div class="end">~
> <span style="color:green"> () </span>
> </div>
></td>
><td>mcordeix</td>
><td>1614809</td>
><td><a href="?command=compoundinfo&test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200 " onmouseover="Tip('compounds completed/running/queued')"target="_blank">0/0/0 of 0</a></td>
><td>high</td>
><td style="white-space:nowrap"><img class="pbar" src="images/arts/bar_green.gif" style="border-right:2px;border-right-style:solid;border-right-color:#ffffff" width="1%"/><img class="pbar" src="images/arts/bar_gray.gif" width="99%"/></td>
><td></td>
><td></td>
><td></td>
><td></td>
><td colspan="4">
><!-- Florent Vial: this can be alway shown if admin=1 -->
><a href="?command=getrequest&test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200" target="_blank">XML</a>
><a href="?command=getrequest&test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200&raw=1" target="_blank">Raw XML</a>
><a href="?command=compoundinfo&test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200" target="_blank">CINFO</a>
></td>
><td></td>
><td><!-- <script type="text/javascript">DIVShowHideDetails('func:DoPrintArtsDetails')</script> --> </td>
><td></td>
><td></td>
><td></td>
><td></td>
'''
EXpected Output:
-------
Status="QUEUED"
test=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
start=(04.02) 23:29
end=~
user=mcordeix
欢迎使用 Whosebug!
请阅读我们常见问题解答的 How to ask a question 部分。
Explain how you encountered the problem you're trying to solve, and any difficulties that have prevented you from solving it yourself.
到目前为止你尝试过什么来解决这个问题?
让我们给你一个开始。
soup = BeautifulSoup(html_string)
status = soup.find('img').get('alt') # get 'alt' content of the first <img> tag.
# find the first <td> tag with a class="test", get its content, split it using spaces,
version = soup.find('td', class_='test').text.split()[0] # and get the first substring
time_start = soup.find('div', class_='start').text
time_end = soup.find('div', class_='end').text
user = soup.find_all('td')[2].text # get a third <td>'s content.
print status # QUEUED
print version # v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
print time_start # (04.02) 23:29
print time_end # ~ > () >
print user # mcordeix
这只是阅读 bs4's documentation 大约 10 分钟然后自己尝试。
只需弹出 Python 解释器,分配 html_string
变量,导入 beautifulsoup 库,然后尝试。
我相信您可以自己解决 time_end
内容留下的问题。没那么难。
如何从 html 页面获得如下输出?
>html_sting='''<td class="status_icon" rowspan="2"><img alt="QUEUED" src="images/arts/status_QUEUED.png" style="border:none" title="QUEUED"/></td>
><td class="test"> v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
> <div class="start">(04.02) 23:29</div>
> <div class="end">~
> <span style="color:green"> () </span>
> </div>
></td>
><td>mcordeix</td>
><td>1614809</td>
><td><a href="?command=compoundinfo&test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200 " onmouseover="Tip('compounds completed/running/queued')"target="_blank">0/0/0 of 0</a></td>
><td>high</td>
><td style="white-space:nowrap"><img class="pbar" src="images/arts/bar_green.gif" style="border-right:2px;border-right-style:solid;border-right-color:#ffffff" width="1%"/><img class="pbar" src="images/arts/bar_gray.gif" width="99%"/></td>
><td></td>
><td></td>
><td></td>
><td></td>
><td colspan="4">
><!-- Florent Vial: this can be alway shown if admin=1 -->
><a href="?command=getrequest&test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200" target="_blank">XML</a>
><a href="?command=getrequest&test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200&raw=1" target="_blank">Raw XML</a>
><a href="?command=compoundinfo&test_id=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200" target="_blank">CINFO</a>
></td>
><td></td>
><td><!-- <script type="text/javascript">DIVShowHideDetails('func:DoPrintArtsDetails')</script> --> </td>
><td></td>
><td></td>
><td></td>
><td></td>
'''
EXpected Output:
-------
Status="QUEUED"
test=v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
start=(04.02) 23:29
end=~
user=mcordeix
欢迎使用 Whosebug! 请阅读我们常见问题解答的 How to ask a question 部分。
Explain how you encountered the problem you're trying to solve, and any difficulties that have prevented you from solving it yourself.
到目前为止你尝试过什么来解决这个问题?
让我们给你一个开始。
soup = BeautifulSoup(html_string)
status = soup.find('img').get('alt') # get 'alt' content of the first <img> tag.
# find the first <td> tag with a class="test", get its content, split it using spaces,
version = soup.find('td', class_='test').text.split()[0] # and get the first substring
time_start = soup.find('div', class_='start').text
time_end = soup.find('div', class_='end').text
user = soup.find_all('td')[2].text # get a third <td>'s content.
print status # QUEUED
print version # v1402beta_150127_1_OTM_TICKETS_dv_c_UID142307274200
print time_start # (04.02) 23:29
print time_end # ~ > () >
print user # mcordeix
这只是阅读 bs4's documentation 大约 10 分钟然后自己尝试。
只需弹出 Python 解释器,分配 html_string
变量,导入 beautifulsoup 库,然后尝试。
我相信您可以自己解决 time_end
内容留下的问题。没那么难。