使用 lxml/python 解析论坛帖子

Question

当我使用下面的代码时，它将一个 div 拆分为数组中的十五个项目。问题是我希望这个 post 作为数组中的一项。这可能是因为 <br> 标签，但我不确定如何解决它。

from lxml import html
import requests

page = requests.get('http://www.city-data.com/forum/economics/2056372-minimum-wage-vs-liveable-wage.html')

tree = html.fromstring(page.text)

details = tree.xpath('//div[contains(@id, "post_message_33583236")]/text()')

print len(details) #prints 15

Answer 1

使用 xpath（不是文本）找到元素并使用 text_content() 方法：

details = tree.xpath('.//div[contains(@id, "post_message_33583236")]')[0]
print(details.text_content())

打印：

With all the talk about raising the minimum wage, I think the real issue is that people are not getting a liveable wage anymore.  This applies to many skilled people too in which their job tries to pay them -13hr for -30hr type of work.

Not everyone deserves a raise at walmart or other low paying jobs.  I  think everyone should atleast prove themselves for 6 months to year then  start to gradually get a raise. You cant act a fool and get paid the same as people who work hard and try to move up in life. Even if walmart workers weren't making minimum wage and making  hr, you cant really do much making 22k a year other than live in a  cheap/borderline crime infested area

hr gets you about 50 a month after taxes and health coverage at most jobs and ill list just the basic necessities in life
...

使用 lxml/python 解析论坛帖子

Parsing forum posts using lxml/python

python

parsing

lxml

web-scraping

lxml.html