如何使用 python 从 beautifulsoup 输出中删除所有对齐和缩进?

How do I remove all of the aligns and indents from a beautifulsoup output with python?

我试图从 HTML url 的许多不同表格中获取信息,但没有任何 HTML indent/tab 格式。我使用 get_text 来生成我想要的内容,但是它打印出很多白色 space 和制表符。我试过 .strip,但没有达到我想要的效果。

这是我正在使用的 python 脚本:

import csv, simplejson, urllib,
url="http://www.thecomedystudio.com/schedule.html"
response=urllib.urlopen(url)
from bs4 import BeautifulSoup
html = response
soup = BeautifulSoup(html.read())
text = soup.get_text()
print text

最后,我想创建一个事件日历的 csv,但首先我想创建一个 .txt 或不需要太多手动清理的东西。

感谢任何帮助。

您不需要 "clean up" HTML 来使用 BeautifulSoup 解析它。

只需将日期和事件直接解析为 csv 文件即可:

import csv
from urllib2 import urlopen

from bs4 import BeautifulSoup


url = "http://www.thecomedystudio.com/schedule.html"
soup = BeautifulSoup(urlopen(url))

with open('output.csv', 'wb') as f:
    writer = csv.writer(f)

    for item in soup.select('td div[align=center] > b'):
        date = ' '.join(el.strip() for el in item.find_all(text=True))
        event = item.parent.parent.find_next_sibling('td').get_text(strip=True)

        writer.writerow([date, event])

脚本 运行 之后 output.csv 的内容:

Fri 2.27.15,"Rick Canavan hosts with Christine An, Rachel Bloom, Dan Crohn, Wes Hazard, James Huessy, Kelly MacFarland, Peter Martin, Ted Pettingell."
Sat 2.28.15,"Rick Jenkins hosts Taylor Connelly, Lilian DeVane, Andrew Durso, Nate Johnson, Peter Martin, Andrew Mayer, Kofi Thomas, Tim Willis."
Sun 3.1.15,"Peter Martin hosts Sunday Funnies with Nonye Brown-West, Ryan Donahue, Joe Kozlowski, Casey Malone, Etrane Martinez, Kwasi Mensah, Anthony Zonfrelli, Christa Weiss and Sam Jay closing."
Tue 3.3.15,Mystery Lounge! The old-est and only-est magic show in New England! with guest comedian Ryan Donahue.
...
Thu 12.31.15,"New Year's Eve! with Rick Jenkins, Nathan Burke."
Fri 1.1.16,Rick Canavan hosts New Year's Day.