使用 BeautifulSoup 读取 CSV 文件
Read CSV file with BeautifulSoup
在网站上抓取一些信息后,我必须使用 html 格式的原始代码保存文件,因为我没有找到 find_all
列表中文本的解决方案列表。
现在我有了数据,但无法获取文本,因为 bs4
无法识别格式列表。
这是我的开放代码:
with open('/my_file.csv', 'r') as read_obj:
csv_reader = reader(read_obj)
list_of_rows = list(csv_reader)
print(list_of_rows)
这是列表格式:
[['', '0', '1', '2', '3'], ['0','<span class="item">Red <small>col.</small></span>',
'<span class="item">120 <small>cc.</small></span>',
'<span class="item">Available <small>in four days</small></span>',
'<span class="item"><small class="txt-highlight-red">15 min</small></span>'],
['1', '<span class="item">Blue <small>col.</small></span>',
'<span class="item">200 <small>cc.</small></span>',
'<span class="item">Available <small>in a week</small></span>',
'<span class="item">04 mar <small></small></span>'],
['0', '<span class="item">Green <small>col.</small></span>',
'<span class="item">Available <small>immediately</small></span>',
'<span class="item"><small class="txt-highlight-red">2 hours</small></span>']]
有没有办法读取 BeautifulSoup 中的 csv
个文件然后解析它?
任务的目的是只保留文本,删除 '<>'
之间的所有内容(包括 <> 符号)。
您可以制作一个应用 beautifulsoup 对象和 return 文本的函数。如果没有 tags/content 可解析,它将保持原样。
此外,我宁愿只使用 pandas 读取该 csv。
import pandas as pd
from bs4 import BeautifulSoup
df = pd.read_csv('/my_file.csv')
def foo_bar(x):
try:
return BeautifulSoup(x, 'lxml').text
except:
return x
print ('Parsing html in table...')
df = df.applymap(foo_bar)
示例输入:
df = pd.DataFrame([['0','<span class="item">Red <small>col.</small></span>',
'<span class="item">120 <small>cc.</small></span>',
'<span class="item">Available <small>in four days</small></span>',
'<span class="item"><small class="txt-highlight-red">15 min</small></span>'],
['1', '<span class="item">Blue <small>col.</small></span>',
'<span class="item">200 <small>cc.</small></span>',
'<span class="item">Available <small>in a week</small></span>',
'<span class="item">04 mar <small></small></span>'],
['0', '<span class="item">Green <small>col.</small></span>',
'<span class="item">Available <small>immediately</small></span>',
'<span class="item"><small class="txt-highlight-red">2 hours</small></span>']], columns = ['', '0', '1', '2', '3'])
原文table:
print (df.to_string())
0 1 2 3
0 0 <span class="item">Red <small>col.</small></span> <span class="item">120 <small>cc.</small></span> <span class="item">Available <small>in four da... <span class="item"><small class="txt-highlight...
1 1 <span class="item">Blue <small>col.</small></s... <span class="item">200 <small>cc.</small></span> <span class="item">Available <small>in a week<... <span class="item">04 mar <small></small></span>
2 0 <span class="item">Green <small>col.</small></... <span class="item">Available <small>immediatel... <span class="item"><small class="txt-highlight... None
输出:
print (df.to_string())
0 1 2 3
0 0 Red col. 120 cc. Available in four days 15 min
1 1 Blue col. 200 cc. Available in a week 04 mar
2 0 Green col. Available immediately 2 hours None
在网站上抓取一些信息后,我必须使用 html 格式的原始代码保存文件,因为我没有找到 find_all
列表中文本的解决方案列表。
现在我有了数据,但无法获取文本,因为 bs4
无法识别格式列表。
这是我的开放代码:
with open('/my_file.csv', 'r') as read_obj:
csv_reader = reader(read_obj)
list_of_rows = list(csv_reader)
print(list_of_rows)
这是列表格式:
[['', '0', '1', '2', '3'], ['0','<span class="item">Red <small>col.</small></span>',
'<span class="item">120 <small>cc.</small></span>',
'<span class="item">Available <small>in four days</small></span>',
'<span class="item"><small class="txt-highlight-red">15 min</small></span>'],
['1', '<span class="item">Blue <small>col.</small></span>',
'<span class="item">200 <small>cc.</small></span>',
'<span class="item">Available <small>in a week</small></span>',
'<span class="item">04 mar <small></small></span>'],
['0', '<span class="item">Green <small>col.</small></span>',
'<span class="item">Available <small>immediately</small></span>',
'<span class="item"><small class="txt-highlight-red">2 hours</small></span>']]
有没有办法读取 BeautifulSoup 中的 csv
个文件然后解析它?
任务的目的是只保留文本,删除 '<>'
之间的所有内容(包括 <> 符号)。
您可以制作一个应用 beautifulsoup 对象和 return 文本的函数。如果没有 tags/content 可解析,它将保持原样。
此外,我宁愿只使用 pandas 读取该 csv。
import pandas as pd
from bs4 import BeautifulSoup
df = pd.read_csv('/my_file.csv')
def foo_bar(x):
try:
return BeautifulSoup(x, 'lxml').text
except:
return x
print ('Parsing html in table...')
df = df.applymap(foo_bar)
示例输入:
df = pd.DataFrame([['0','<span class="item">Red <small>col.</small></span>',
'<span class="item">120 <small>cc.</small></span>',
'<span class="item">Available <small>in four days</small></span>',
'<span class="item"><small class="txt-highlight-red">15 min</small></span>'],
['1', '<span class="item">Blue <small>col.</small></span>',
'<span class="item">200 <small>cc.</small></span>',
'<span class="item">Available <small>in a week</small></span>',
'<span class="item">04 mar <small></small></span>'],
['0', '<span class="item">Green <small>col.</small></span>',
'<span class="item">Available <small>immediately</small></span>',
'<span class="item"><small class="txt-highlight-red">2 hours</small></span>']], columns = ['', '0', '1', '2', '3'])
原文table:
print (df.to_string())
0 1 2 3
0 0 <span class="item">Red <small>col.</small></span> <span class="item">120 <small>cc.</small></span> <span class="item">Available <small>in four da... <span class="item"><small class="txt-highlight...
1 1 <span class="item">Blue <small>col.</small></s... <span class="item">200 <small>cc.</small></span> <span class="item">Available <small>in a week<... <span class="item">04 mar <small></small></span>
2 0 <span class="item">Green <small>col.</small></... <span class="item">Available <small>immediatel... <span class="item"><small class="txt-highlight... None
输出:
print (df.to_string())
0 1 2 3
0 0 Red col. 120 cc. Available in four days 15 min
1 1 Blue col. 200 cc. Available in a week 04 mar
2 0 Green col. Available immediately 2 hours None