Python 3.4:LXML:解析表
Python 3.4 : LXML : Parsing Tables
我想解析来自雅虎财经的整个 table。据我了解,'tbody' 和 'thead' 标签不是由 lxml 注册的,而是作为附加 TR 注册的,因此我将 xpath 切换为:
/html/body/div[4]/div[4]/table[2]/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody
到下面代码中看到的内容
url = 'http://finance.yahoo.com/q/is?s=MMM+Income+Statement&annual'
tree = html.parse(url)
tick_content = [td.text_content() for td in tree.xpath('/html/body/div[4]/div[4]/table[2]/tr[3]/td/table[2]/tr[1]/td/table/td[1]')]
print(tick_content)
我返回的是空白屏幕。有没有特殊的方法来解析 table orrrr?
与其使用由 Chrome 生成的巨大的长 XPath,不如使用 yfnc_tabledata1
class 搜索 table;只有一个:
>>> tree.xpath("//table[@class='yfnc_tabledata1']")
[<Element table at 0x10445e788>]
从那里前往您的<td>
:
>>> tree.xpath("//table[@class='yfnc_tabledata1']//td[1]")[0].text_content()
'Period EndingDec 31, 2014Dec 31, 2013Dec 31, 2012\n \n Total Revenue\n \n \n \n 31,821,000\xa0\xa0\n \n \n \n 30,871,000\xa0\xa0\n \n \n \n 29,904,000\xa0\xa0\n \n Cost of Revenue16,447,000\xa0\xa016,106,000\xa0\xa015,685,000\xa0\xa0\n \n Gross Profit\n \n \n \n 15,374,000\xa0\xa0\n \n \n \n 14,765,000\xa0\xa0\n \n \n \n 14,219,000\xa0\xa0\n \n \n \n Operating Expenses\n \n Research Development1,770,000\xa0\xa01,715,000\xa0\xa01,634,000\xa0\xa0\n \n Selling General and Administrative6,469,000\xa0\xa06,384,000\xa0\xa06,102,000\xa0\xa0\n \n Non Recurring\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Others\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n \n \n Total Operating Expenses\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Operating Income or Loss\n \n \n \n 7,135,000\xa0\xa0\n \n \n \n 6,666,000\xa0\xa0\n \n \n \n 6,483,000\xa0\xa0\n \n \n \n Income from Continuing Operations\n \n Total Other Income/Expenses Net33,000\xa0\xa041,000\xa0\xa039,000\xa0\xa0\n \n Earnings Before Interest And Taxes7,168,000\xa0\xa06,707,000\xa0\xa06,522,000\xa0\xa0\n \n Interest Expense142,000\xa0\xa0145,000\xa0\xa0171,000\xa0\xa0\n \n Income Before Tax7,026,000\xa0\xa06,562,000\xa0\xa06,351,000\xa0\xa0\n \n Income Tax Expense2,028,000\xa0\xa01,841,000\xa0\xa01,840,000\xa0\xa0\n \n Minority Interest(42,000)(62,000)(67,000)\n \n \n \n Net Income From Continuing Ops4,956,000\xa0\xa04,659,000\xa0\xa04,444,000\xa0\xa0\n \n Non-recurring Events\n \n Discontinued Operations\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Extraordinary Items\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Effect Of Accounting Changes\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Other Items\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Net Income\n \n \n \n 4,956,000\xa0\xa0\n \n \n \n 4,659,000\xa0\xa0\n \n \n \n 4,444,000\xa0\xa0\n \n Preferred Stock And Other Adjustments\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Net Income Applicable To Common Shares\n \n \n \n 4,956,000\xa0\xa0\n \n \n \n 4,659,000\xa0\xa0\n \n \n \n 4,444,000\xa0\xa0\n \n '
>>> print(tree.xpath("//table[@class='yfnc_tabledata1']//td[1]")[0].text_content())
Period EndingDec 31, 2014Dec 31, 2013Dec 31, 2012
Total Revenue
31,821,000
30,871,000
29,904,000
Cost of Revenue16,447,000 16,106,000 15,685,000
Gross Profit
15,374,000
14,765,000
14,219,000
Operating Expenses
Research Development1,770,000 1,715,000 1,634,000
Selling General and Administrative6,469,000 6,384,000 6,102,000
Non Recurring
-
-
-
Others
-
-
-
Total Operating Expenses
-
-
-
Operating Income or Loss
7,135,000
6,666,000
6,483,000
Income from Continuing Operations
Total Other Income/Expenses Net33,000 41,000 39,000
Earnings Before Interest And Taxes7,168,000 6,707,000 6,522,000
Interest Expense142,000 145,000 171,000
Income Before Tax7,026,000 6,562,000 6,351,000
Income Tax Expense2,028,000 1,841,000 1,840,000
Minority Interest(42,000)(62,000)(67,000)
Net Income From Continuing Ops4,956,000 4,659,000 4,444,000
Non-recurring Events
Discontinued Operations
-
-
-
Extraordinary Items
-
-
-
Effect Of Accounting Changes
-
-
-
Other Items
-
-
-
Net Income
4,956,000
4,659,000
4,444,000
Preferred Stock And Other Adjustments
-
-
-
Net Income Applicable To Common Shares
4,956,000
4,659,000
4,444,000
我想解析来自雅虎财经的整个 table。据我了解,'tbody' 和 'thead' 标签不是由 lxml 注册的,而是作为附加 TR 注册的,因此我将 xpath 切换为:
/html/body/div[4]/div[4]/table[2]/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody
到下面代码中看到的内容
url = 'http://finance.yahoo.com/q/is?s=MMM+Income+Statement&annual'
tree = html.parse(url)
tick_content = [td.text_content() for td in tree.xpath('/html/body/div[4]/div[4]/table[2]/tr[3]/td/table[2]/tr[1]/td/table/td[1]')]
print(tick_content)
我返回的是空白屏幕。有没有特殊的方法来解析 table orrrr?
与其使用由 Chrome 生成的巨大的长 XPath,不如使用 yfnc_tabledata1
class 搜索 table;只有一个:
>>> tree.xpath("//table[@class='yfnc_tabledata1']")
[<Element table at 0x10445e788>]
从那里前往您的<td>
:
>>> tree.xpath("//table[@class='yfnc_tabledata1']//td[1]")[0].text_content()
'Period EndingDec 31, 2014Dec 31, 2013Dec 31, 2012\n \n Total Revenue\n \n \n \n 31,821,000\xa0\xa0\n \n \n \n 30,871,000\xa0\xa0\n \n \n \n 29,904,000\xa0\xa0\n \n Cost of Revenue16,447,000\xa0\xa016,106,000\xa0\xa015,685,000\xa0\xa0\n \n Gross Profit\n \n \n \n 15,374,000\xa0\xa0\n \n \n \n 14,765,000\xa0\xa0\n \n \n \n 14,219,000\xa0\xa0\n \n \n \n Operating Expenses\n \n Research Development1,770,000\xa0\xa01,715,000\xa0\xa01,634,000\xa0\xa0\n \n Selling General and Administrative6,469,000\xa0\xa06,384,000\xa0\xa06,102,000\xa0\xa0\n \n Non Recurring\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Others\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n \n \n Total Operating Expenses\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Operating Income or Loss\n \n \n \n 7,135,000\xa0\xa0\n \n \n \n 6,666,000\xa0\xa0\n \n \n \n 6,483,000\xa0\xa0\n \n \n \n Income from Continuing Operations\n \n Total Other Income/Expenses Net33,000\xa0\xa041,000\xa0\xa039,000\xa0\xa0\n \n Earnings Before Interest And Taxes7,168,000\xa0\xa06,707,000\xa0\xa06,522,000\xa0\xa0\n \n Interest Expense142,000\xa0\xa0145,000\xa0\xa0171,000\xa0\xa0\n \n Income Before Tax7,026,000\xa0\xa06,562,000\xa0\xa06,351,000\xa0\xa0\n \n Income Tax Expense2,028,000\xa0\xa01,841,000\xa0\xa01,840,000\xa0\xa0\n \n Minority Interest(42,000)(62,000)(67,000)\n \n \n \n Net Income From Continuing Ops4,956,000\xa0\xa04,659,000\xa0\xa04,444,000\xa0\xa0\n \n Non-recurring Events\n \n Discontinued Operations\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Extraordinary Items\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Effect Of Accounting Changes\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Other Items\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Net Income\n \n \n \n 4,956,000\xa0\xa0\n \n \n \n 4,659,000\xa0\xa0\n \n \n \n 4,444,000\xa0\xa0\n \n Preferred Stock And Other Adjustments\n -\n \xa0\n -\n \xa0\n -\n \xa0\n \n Net Income Applicable To Common Shares\n \n \n \n 4,956,000\xa0\xa0\n \n \n \n 4,659,000\xa0\xa0\n \n \n \n 4,444,000\xa0\xa0\n \n '
>>> print(tree.xpath("//table[@class='yfnc_tabledata1']//td[1]")[0].text_content())
Period EndingDec 31, 2014Dec 31, 2013Dec 31, 2012
Total Revenue
31,821,000
30,871,000
29,904,000
Cost of Revenue16,447,000 16,106,000 15,685,000
Gross Profit
15,374,000
14,765,000
14,219,000
Operating Expenses
Research Development1,770,000 1,715,000 1,634,000
Selling General and Administrative6,469,000 6,384,000 6,102,000
Non Recurring
-
-
-
Others
-
-
-
Total Operating Expenses
-
-
-
Operating Income or Loss
7,135,000
6,666,000
6,483,000
Income from Continuing Operations
Total Other Income/Expenses Net33,000 41,000 39,000
Earnings Before Interest And Taxes7,168,000 6,707,000 6,522,000
Interest Expense142,000 145,000 171,000
Income Before Tax7,026,000 6,562,000 6,351,000
Income Tax Expense2,028,000 1,841,000 1,840,000
Minority Interest(42,000)(62,000)(67,000)
Net Income From Continuing Ops4,956,000 4,659,000 4,444,000
Non-recurring Events
Discontinued Operations
-
-
-
Extraordinary Items
-
-
-
Effect Of Accounting Changes
-
-
-
Other Items
-
-
-
Net Income
4,956,000
4,659,000
4,444,000
Preferred Stock And Other Adjustments
-
-
-
Net Income Applicable To Common Shares
4,956,000
4,659,000
4,444,000