如果不在“<a href”中,是否从列表中删除项目?
Removing items from list if is not in '<a href'?
我将数据存储在如下列表中:
date_name = [<a href="/president/washington/speeches/speech-3455">Proclamation of Neutrality (April 22, 1793)</a>,
<a class="transcript" href="/president/washington/speeches/speech-3455">Transcript</a>,
<a href="/president/washington/speeches/speech-3456">Fifth Annual Message to Congress (December 3, 1793)</a>,
<a class="transcript" href="/president/washington/speeches/speech-3456">Transcript</a>,
<a href="/president/washington/speeches/speech-3721">Proclamation against Opposition to Execution of Laws and Excise Duties in Western Pennsylvania (August 7, 1794)</a>]
这些不是 date_name
中的 str
个元素。我正在尝试获取 Proclamation of Neutrality (April 22, 1793)
、Fifth Annual Message to Congress (December 3, 1793)
和 Proclamation against Opposition to Execution of Laws and Excise Duties in Western Pennsylvania (August 7, 1794)
,以便我可以从每个演讲中获取日期。我想为 900 多场演讲做这件事。这是我一直在尝试的代码,因为它适用于我在另一个列表理解场景中遇到的类似问题:
url = 'http://www.millercenter.org/president/speeches'
connection = urllib2.urlopen(url)
html = connection.read()
date_soup = BeautifulSoup(html)
date_name = date_soup.find_all('a')
del date_name[:203] # delete extraneous html before first link (for obama 4453)
# do something with the following list comprehensions
dater = [tag.get('<a href=') for tag in date_name if tag.get('<a href=') is not None]
# remove all items in list that don't contain '<a href=', as this string is unique
# to the elements in date_name that I want
speeches_dates = [_ for _ in dater if re.search('<a href=',_)]
但是,我得到一个带有 dater
变量过程的空集,所以我无法继续构建 speeches_dates
。
您看到的是 ResultSet
- Tag
个实例的列表。当您打印 Tag
时,您会得到一个 HTML 字符串表示。您需要的是获取短信:
date_name = date_soup.find_all('a')[:203]
print([item.get_text(strip=True) for item in date_name])
此外,据我了解,您需要演讲的链接(在包含日期的主要内容中)。在这种情况下,您需要使定位器更加具体,而不是定位所有 a
标签:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.millercenter.org/president/speeches'
date_soup = BeautifulSoup(urllib2.urlopen(url), "lxml")
speeches = date_soup.select('div#listing div.title a[href*=speeches]')
for speech in speeches:
text = speech.get_text(strip=True)
print(text)
打印:
Acceptance Speech at the Democratic National Convention (August 28, 2008)
Remarks on Election Night (November 4, 2008)
Inaugural Address (January 20, 2009)
...
Talk to the Cherokee Nation (August 29, 1796)
Farewell Address (September 19, 1796)
Eighth Annual Message to Congress (December 7, 1796)
我将数据存储在如下列表中:
date_name = [<a href="/president/washington/speeches/speech-3455">Proclamation of Neutrality (April 22, 1793)</a>,
<a class="transcript" href="/president/washington/speeches/speech-3455">Transcript</a>,
<a href="/president/washington/speeches/speech-3456">Fifth Annual Message to Congress (December 3, 1793)</a>,
<a class="transcript" href="/president/washington/speeches/speech-3456">Transcript</a>,
<a href="/president/washington/speeches/speech-3721">Proclamation against Opposition to Execution of Laws and Excise Duties in Western Pennsylvania (August 7, 1794)</a>]
这些不是 date_name
中的 str
个元素。我正在尝试获取 Proclamation of Neutrality (April 22, 1793)
、Fifth Annual Message to Congress (December 3, 1793)
和 Proclamation against Opposition to Execution of Laws and Excise Duties in Western Pennsylvania (August 7, 1794)
,以便我可以从每个演讲中获取日期。我想为 900 多场演讲做这件事。这是我一直在尝试的代码,因为它适用于我在另一个列表理解场景中遇到的类似问题:
url = 'http://www.millercenter.org/president/speeches'
connection = urllib2.urlopen(url)
html = connection.read()
date_soup = BeautifulSoup(html)
date_name = date_soup.find_all('a')
del date_name[:203] # delete extraneous html before first link (for obama 4453)
# do something with the following list comprehensions
dater = [tag.get('<a href=') for tag in date_name if tag.get('<a href=') is not None]
# remove all items in list that don't contain '<a href=', as this string is unique
# to the elements in date_name that I want
speeches_dates = [_ for _ in dater if re.search('<a href=',_)]
但是,我得到一个带有 dater
变量过程的空集,所以我无法继续构建 speeches_dates
。
您看到的是 ResultSet
- Tag
个实例的列表。当您打印 Tag
时,您会得到一个 HTML 字符串表示。您需要的是获取短信:
date_name = date_soup.find_all('a')[:203]
print([item.get_text(strip=True) for item in date_name])
此外,据我了解,您需要演讲的链接(在包含日期的主要内容中)。在这种情况下,您需要使定位器更加具体,而不是定位所有 a
标签:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.millercenter.org/president/speeches'
date_soup = BeautifulSoup(urllib2.urlopen(url), "lxml")
speeches = date_soup.select('div#listing div.title a[href*=speeches]')
for speech in speeches:
text = speech.get_text(strip=True)
print(text)
打印:
Acceptance Speech at the Democratic National Convention (August 28, 2008)
Remarks on Election Night (November 4, 2008)
Inaugural Address (January 20, 2009)
...
Talk to the Cherokee Nation (August 29, 1796)
Farewell Address (September 19, 1796)
Eighth Annual Message to Congress (December 7, 1796)