Python beautifulSoup:创建和组合列表并删除冗余,如 \n
Python beautifulSoup: create and combine lists and remove redundancies like \n
如何将完整列表合并到一个数据框中。当我打印时,它似乎只打印第一条记录,它还包括 \n 和其他冗余,如 ' 等。
import requests
from requests_html import HTML, HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import csv
import json
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
lehigh = requests.get(url).text
soup = BeautifulSoup(lehigh,'lxml')
for opp in soup.find_all('div',class_="sidearm-schedule-game-opponent-text"):
opp_list = []
opp_list.append(opp.text)
# print(opp_list)
for conf in soup.find_all('div',class_="sidearm-schedule-game-conference-conference"):
conf_list = []
conf_list.append(conf.text)
# print(conf_list)
dict = {'opponent':[opp_list],'conference':[conf_list]}
df = pd.DataFrame(dict)
print(df)
您在每次迭代中将 opp_list
和 conf_list
设置为 []
- 仅将它们初始化一次。 Alson,你不必在创建字典时加上括号 {'opponent':opp_list,'conference':conf_list}
要删除空格,您可以使用带有 strip=True
和 separator=
参数的 .get_text()
方法。
例如:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
lehigh = requests.get(url).text
soup = BeautifulSoup(lehigh,'lxml')
opp_list = []
for opp in soup.find_all('div',class_="sidearm-schedule-game-opponent-text"):
opp_list.append(opp.get_text(strip=True, separator=' '))
conf_list = []
for conf in soup.find_all('div',class_="sidearm-schedule-game-conference-conference"):
conf_list.append(conf.get_text(strip=True))
dict = {'opponent':opp_list,'conference':conf_list}
df = pd.DataFrame(dict)
print(df)
打印:
opponent conference
0 at UConn
1 vs Drexel
2 at George Washington
3 at St. John's
4 vs Binghamton
5 at Rider
6 vs Penn
7 at Army Patriot League*
8 vs Cornell
9 at Boston U Patriot League*
10 vs #20 Colgate Patriot League*
11 vs Navy Patriot League*
12 at Lafayette Patriot League*
13 at Dartmouth
14 vs American Patriot League*
15 at Bucknell Patriot League*
16 at Loyola (Md.) Patriot League*
17 vs Holy Cross Senior Night Patriot League*
18 vs No. 3 Colgate (Semifinals)
如何将完整列表合并到一个数据框中。当我打印时,它似乎只打印第一条记录,它还包括 \n 和其他冗余,如 ' 等。
import requests
from requests_html import HTML, HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import csv
import json
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
lehigh = requests.get(url).text
soup = BeautifulSoup(lehigh,'lxml')
for opp in soup.find_all('div',class_="sidearm-schedule-game-opponent-text"):
opp_list = []
opp_list.append(opp.text)
# print(opp_list)
for conf in soup.find_all('div',class_="sidearm-schedule-game-conference-conference"):
conf_list = []
conf_list.append(conf.text)
# print(conf_list)
dict = {'opponent':[opp_list],'conference':[conf_list]}
df = pd.DataFrame(dict)
print(df)
您在每次迭代中将 opp_list
和 conf_list
设置为 []
- 仅将它们初始化一次。 Alson,你不必在创建字典时加上括号 {'opponent':opp_list,'conference':conf_list}
要删除空格,您可以使用带有 strip=True
和 separator=
参数的 .get_text()
方法。
例如:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://lehighsports.com/sports/mens-soccer/schedule/2018'
lehigh = requests.get(url).text
soup = BeautifulSoup(lehigh,'lxml')
opp_list = []
for opp in soup.find_all('div',class_="sidearm-schedule-game-opponent-text"):
opp_list.append(opp.get_text(strip=True, separator=' '))
conf_list = []
for conf in soup.find_all('div',class_="sidearm-schedule-game-conference-conference"):
conf_list.append(conf.get_text(strip=True))
dict = {'opponent':opp_list,'conference':conf_list}
df = pd.DataFrame(dict)
print(df)
打印:
opponent conference
0 at UConn
1 vs Drexel
2 at George Washington
3 at St. John's
4 vs Binghamton
5 at Rider
6 vs Penn
7 at Army Patriot League*
8 vs Cornell
9 at Boston U Patriot League*
10 vs #20 Colgate Patriot League*
11 vs Navy Patriot League*
12 at Lafayette Patriot League*
13 at Dartmouth
14 vs American Patriot League*
15 at Bucknell Patriot League*
16 at Loyola (Md.) Patriot League*
17 vs Holy Cross Senior Night Patriot League*
18 vs No. 3 Colgate (Semifinals)