HTML 文件解析部分到 csv
HTML file parse section to csv
我是 Python 的新手。我正在尝试从网页 (https://www.dropbox.com/s/uka24w7o5006ole/transcript-86-855.html?dl=0) 的高管(在顶部提到)那里获得所有答案。此网页位于我的硬盘上(所以没有 url)。
所以我的最终结果是:
Column 1
All executives
Column 2
all the answers
答案只能来自 "question-and-answer-section"。
我尝试的是以下内容:
from bs4 import BeautifulSoup
import requests
with open('transcript-86-855.html') as html_file:
soup=BeautifulSoup(html_file, 'lxml')
article_qanda = soup.find('DIV', id='article_qanda'
有人可以帮我吗?
如果我没理解错的话,你要打印两列,一列是姓名(在本例中Dror Ben Asher
),另一列是他的答案。
例如:
import textwrap
from bs4 import BeautifulSoup
with open('page.html', 'r') as f_in:
soup = BeautifulSoup(f_in.read(), 'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt))
print()
打印:
Name Answer
-----------------------------------------------------------------------------------------------------
Dror Ben Asher - CEO Thank you, Scott. Its a very good question indeed in January we
announced a new amendment and that amendment includes anti-TNF
patients some of them not all of them, those who qualify. And we are
talking about anti-TNF failures to be clear and only Remicade and
Humira. The idea here was to increase very significantly the patients
pooled of those potentially eligible for the study thus expediting
recruitment. Did I answer your question?
Dror Ben Asher - CEO Right, this is one of most important tasks; right now the most
important item here is the divestment of non-core assets. All other
non-core assets, the non-core assets are those that are not within our
therapeutic focus of GI and inflammation. And those are specifically
RHB-103 RIZAPORT for migraine and RHB-101 which is a cardio drug.
RHB-101 is a legacy drug, we have recently announced last month, we
announced that we are in discussions for both of these product for
out-licensing, which we hope to complete in the first half of 2015. So
this is the highest priority, obviously discussion on other product,
but Redhill is in the fortunate position that we are able to complete
our Phase III studies with our existing results, resources and as time
goes by obviously the value of the assets keeps going up. So we are in
no rush to out-license everything else and so there is obviously in
track.
...and so on.
我是 Python 的新手。我正在尝试从网页 (https://www.dropbox.com/s/uka24w7o5006ole/transcript-86-855.html?dl=0) 的高管(在顶部提到)那里获得所有答案。此网页位于我的硬盘上(所以没有 url)。
所以我的最终结果是:
Column 1
All executives
Column 2
all the answers
答案只能来自 "question-and-answer-section"。
我尝试的是以下内容:
from bs4 import BeautifulSoup
import requests
with open('transcript-86-855.html') as html_file:
soup=BeautifulSoup(html_file, 'lxml')
article_qanda = soup.find('DIV', id='article_qanda'
有人可以帮我吗?
如果我没理解错的话,你要打印两列,一列是姓名(在本例中Dror Ben Asher
),另一列是他的答案。
例如:
import textwrap
from bs4 import BeautifulSoup
with open('page.html', 'r') as f_in:
soup = BeautifulSoup(f_in.read(), 'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt))
print()
打印:
Name Answer
-----------------------------------------------------------------------------------------------------
Dror Ben Asher - CEO Thank you, Scott. Its a very good question indeed in January we
announced a new amendment and that amendment includes anti-TNF
patients some of them not all of them, those who qualify. And we are
talking about anti-TNF failures to be clear and only Remicade and
Humira. The idea here was to increase very significantly the patients
pooled of those potentially eligible for the study thus expediting
recruitment. Did I answer your question?
Dror Ben Asher - CEO Right, this is one of most important tasks; right now the most
important item here is the divestment of non-core assets. All other
non-core assets, the non-core assets are those that are not within our
therapeutic focus of GI and inflammation. And those are specifically
RHB-103 RIZAPORT for migraine and RHB-101 which is a cardio drug.
RHB-101 is a legacy drug, we have recently announced last month, we
announced that we are in discussions for both of these product for
out-licensing, which we hope to complete in the first half of 2015. So
this is the highest priority, obviously discussion on other product,
but Redhill is in the fortunate position that we are able to complete
our Phase III studies with our existing results, resources and as time
goes by obviously the value of the assets keeps going up. So we are in
no rush to out-license everything else and so there is obviously in
track.
...and so on.