试图用美丽的汤隔离 1 列
Trying to isolate 1 column with beautiful soup
我正在尝试隔离 Location 列,然后最终将其输出到数据库文件。我的代码如下:
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
trs = soup.find_all('td')
for tr in trs:
for link in tr.find_all('a'):
fulllink = link.get ('href')
tds = tr.find_all("tr")
location = str(tds[3].get_text())
print location
但我总是遇到 2 个错误中的 1 个,要么是列表超出范围,要么是退出代码为“0”。我不确定 beautfulsoup,因为我正在尝试学习它,所以感谢任何帮助谢谢!
您只需在代码中交换 td
和 tr
应答器。并且要小心使用 str()
函数,因为您的网页中可能有无法转换为简单 ascii 字符串的 unicode 字符串。您的代码应该是:
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
trs = soup.find_all('tr') # 'tr' instead of td
for tr in trs:
for link in tr.find_all('a'):
fulllink = link.get ('href')
tds = tr.find_all("td") # 'td' instead of td
location = tds[3].get_text() # remove of str function
print location
还有 瞧!!
有一种更简单的方法可以找到 Location
列。使用 table.wikitable tr
CSS Selector,找到每一行的所有 td
元素,并通过索引获得第 4 个 td
。
此外,如果一个单元格内有多个位置,则需要分别对待:
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
soup = BeautifulSoup(urllib2.urlopen(url))
for row in soup.select('table.wikitable tr'):
cells = row.find_all('td')
if cells:
for text in cells[3].find_all(text=True):
text = text.strip()
if text:
print text
打印:
Afghanistan
Nigeria
Cameroon
Niger
Chad
...
Iran
Nigeria
Mozambique
我正在尝试隔离 Location 列,然后最终将其输出到数据库文件。我的代码如下:
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
trs = soup.find_all('td')
for tr in trs:
for link in tr.find_all('a'):
fulllink = link.get ('href')
tds = tr.find_all("tr")
location = str(tds[3].get_text())
print location
但我总是遇到 2 个错误中的 1 个,要么是列表超出范围,要么是退出代码为“0”。我不确定 beautfulsoup,因为我正在尝试学习它,所以感谢任何帮助谢谢!
您只需在代码中交换 td
和 tr
应答器。并且要小心使用 str()
函数,因为您的网页中可能有无法转换为简单 ascii 字符串的 unicode 字符串。您的代码应该是:
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
trs = soup.find_all('tr') # 'tr' instead of td
for tr in trs:
for link in tr.find_all('a'):
fulllink = link.get ('href')
tds = tr.find_all("td") # 'td' instead of td
location = tds[3].get_text() # remove of str function
print location
还有 瞧!!
有一种更简单的方法可以找到 Location
列。使用 table.wikitable tr
CSS Selector,找到每一行的所有 td
元素,并通过索引获得第 4 个 td
。
此外,如果一个单元格内有多个位置,则需要分别对待:
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/List_of_ongoing_armed_conflicts"
soup = BeautifulSoup(urllib2.urlopen(url))
for row in soup.select('table.wikitable tr'):
cells = row.find_all('td')
if cells:
for text in cells[3].find_all(text=True):
text = text.strip()
if text:
print text
打印:
Afghanistan
Nigeria
Cameroon
Niger
Chad
...
Iran
Nigeria
Mozambique