python 2.7 中使用 lxml 的网页抓取中缺少一列和冗余 whitespaces/newlines
Missing one column and redundent whitespaces/newlines in webpage scraping using lxml in python 2.7
我正在尝试抓取 this page in python to get the biggest table in that page into a csv
. I'm mostly following the answer here。
但我面临两个问题:
- 缺少 行使价 的列
- 由于包含大量“\r”并以单个“\n”结尾的异常字符串,将数据写入 csv 时未对齐。这会在
csv
中放置很多空白字符
以下是我正在使用的代码。请帮我解决这两个问题。
from urllib2 import Request, urlopen
from lxml import etree
import csv
ourl = "http://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?segmentLink=17&instrument=OPTIDX&symbol=NIFTY&date=31DEC2015"
headers = {'Accept' : '*/*',
'Accept-Language' : 'en-US,en;q=0.5',
'Host': 'nseindia.com',
'Referer': 'http://www.nseindia.com/live_market/dynaContent/live_market.htm',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/35.0',
'X-Requested-With': 'XMLHttpRequest'}
req = Request(ourl, None, headers)
response = urlopen(req)
the_page = response.read()
ptree = etree.HTML(the_page)
tr_nodes = ptree.xpath('//table[@id="octable"]/tr')
header = [i[0].text for i in tr_nodes[0].xpath("th")]
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]
with open("nseoc.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(td_content)
Writing the data to csv is misaligned due to a aberrant string containing multitudes of "\r" and ending with a single "\n"
首先,我会使用 lxml.html
包,获取每个单元格的 text_content()
,然后应用 strip()
:
from lxml.html import fromstring
ptree = fromstring(the_page)
tr_nodes = ptree.xpath('//table[@id="octable"]//tr')[1:]
td_content = [[td.text_content().strip() for td in tr.xpath('td')]
for tr in tr_nodes[1:]]
这是 td_content
的样子:
[
['', '700', '-', '-', '-', '5,179.00', '-', '1,350', '4,972.25', '5,006.15', '450', '2700.00', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', ''],
['', '-', '-', '-', '-', '-', '-', '1,200', '4,710.85', '5,254.15', '150', '2800.00', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', ''],
...
]
请注意 "Strike Price" 在那里(2700 和 2800)。
我正在尝试抓取 this page in python to get the biggest table in that page into a csv
. I'm mostly following the answer here。
但我面临两个问题:
- 缺少 行使价 的列
- 由于包含大量“\r”并以单个“\n”结尾的异常字符串,将数据写入 csv 时未对齐。这会在
csv
中放置很多空白字符
以下是我正在使用的代码。请帮我解决这两个问题。
from urllib2 import Request, urlopen
from lxml import etree
import csv
ourl = "http://www.nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?segmentLink=17&instrument=OPTIDX&symbol=NIFTY&date=31DEC2015"
headers = {'Accept' : '*/*',
'Accept-Language' : 'en-US,en;q=0.5',
'Host': 'nseindia.com',
'Referer': 'http://www.nseindia.com/live_market/dynaContent/live_market.htm',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/35.0',
'X-Requested-With': 'XMLHttpRequest'}
req = Request(ourl, None, headers)
response = urlopen(req)
the_page = response.read()
ptree = etree.HTML(the_page)
tr_nodes = ptree.xpath('//table[@id="octable"]/tr')
header = [i[0].text for i in tr_nodes[0].xpath("th")]
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]
with open("nseoc.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(td_content)
Writing the data to csv is misaligned due to a aberrant string containing multitudes of "\r" and ending with a single "\n"
首先,我会使用 lxml.html
包,获取每个单元格的 text_content()
,然后应用 strip()
:
from lxml.html import fromstring
ptree = fromstring(the_page)
tr_nodes = ptree.xpath('//table[@id="octable"]//tr')[1:]
td_content = [[td.text_content().strip() for td in tr.xpath('td')]
for tr in tr_nodes[1:]]
这是 td_content
的样子:
[
['', '700', '-', '-', '-', '5,179.00', '-', '1,350', '4,972.25', '5,006.15', '450', '2700.00', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', ''],
['', '-', '-', '-', '-', '-', '-', '1,200', '4,710.85', '5,254.15', '150', '2800.00', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', ''],
...
]
请注意 "Strike Price" 在那里(2700 和 2800)。