为什么搜索查询 table 显示 table Headers,而不是 BeautifulSoup (Python) 中的数据?
Why is search query table displaying table Headers, and not data in BeautifulSoup (Python)?
我正在尝试解析此 Link 以搜索结果
请select:
- 学校=全部
- 运动=足球
- 会议=全部
- 年份=2005-2006
- 州=全部
此搜索结果包含 226 个条目,我想解析所有 226 个条目并将其转换为 pandas 数据帧,这样数据帧包含 "School"、"Conference"、"GSR"、'FGR' 和 'State'。所以,到目前为止,我能够解析 Table headers,但我无法解析来自 table 的数据。请告知代码和解释。
注意:我是Python和Beautifulsoup的新手。
到目前为止我尝试过的代码:
url='https://web3.ncaa.org/aprsearch/gsrsearch'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
if len(T)!=10:
break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
到目前为止的输出:
您可以粘贴 headers 和负载,然后使用 .post
。我仍在学习如何正确使用它,并且不太确定到底需要什么(或者 "sensitive info" 是什么,这就是为什么我涂掉了其中的一些……就像我说的,我还在学习),但是设法 return json。
这将 return json 然后仅转换为数据帧。
您可以通过对页面执行 "Inspect" 来获取 headers 和有效载荷,然后单击 XHR(您可能需要刷新页面以便 gsrsearch
出现。然后只需点击它并滚动找到它。不过你必须把引号放在那里。
代码:
import json
import requests
from pandas.io.json import json_normalize
url='https://web3.ncaa.org/aprsearch/gsrsearch'
# Here's where you'll put your headers from Inspect
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
...
...
...
'X-Requested-With': 'XMLHttpRequest'}
# Here's where you put Form Data from Inspect
payload = {'schoolOrgId': '',
'conferenceOrgId':'',
'sportCode': 'MFB',
'cohortYear': '2005', # I changed this to year 2005
'state':'',
... }
r = requests.post(url, headers=headers, data=payload)
jsonStr = r.text
jsonObj = json.loads(jsonStr)
df = json_normalize(jsonObj)
输出:
print (df)
cohortYear conferenceId ... sportDesc state
0 2005 875 ... Football OH
1 2005 916 ... Football AL
2 2005 916 ... Football AL
3 2005 911 ... Football AL
4 2005 24312 ... Football AL
5 2005 846 ... Football NY
6 2005 916 ... Football MS
7 2005 912 ... Football NC
8 2005 905 ... Football AZ
9 2005 905 ... Football AZ
10 2005 818 ... Football AR
11 2005 911 ... Football AR
12 2005 911 ... Football AL
13 2005 902 ... Football TN
14 2005 875 ... Football IN
15 2005 826 ... Football SC
16 2005 25354 ... Football TX
17 2005 876 ... Football FL
18 2005 5486 ... Football ID
19 2005 821 ... Football MA
20 2005 875 ... Football OH
21 2005 0 ... Football UT
22 2005 865 ... Football RI
23 2005 846 ... Football RI
24 2005 838 ... Football PA
25 2005 875 ... Football NY
26 2005 21451 ... Football IN
27 2005 0 ... Football CA
28 2005 923 ... Football CA
29 2005 825 ... Football CA
.. ... ... ... ... ...
210 2005 0 ... Football MD
211 2005 923 ... Football UT
212 2005 905 ... Football UT
213 2005 21451 ... Football IN
214 2005 911 ... Football TN
215 2005 837 ... Football PA
216 2005 826 ... Football VA
217 2005 821 ... Football VA
218 2005 821 ... Football VA
219 2005 846 ... Football NY
220 2005 821 ... Football NC
221 2005 905 ... Football WA
222 2005 905 ... Football WA
223 2005 825 ... Football UT
224 2005 823 ... Football WV
225 2005 912 ... Football NC
226 2005 853 ... Football IL
227 2005 818 ... Football KY
228 2005 875 ... Football MI
229 2005 837 ... Football VA
230 2005 827 ... Football WI
231 2005 5486 ... Football WY
232 2005 865 ... Football CT
233 2005 853 ... Football OH
234 2005 914 ... Football AR
235 2005 912 ... Football NC
236 2005 826 ... Football NC
237 2005 826 ... Football SC
238 2005 916 ... Football AR
239 2005 912 ... Football SC
[240 rows x 12 columns]
我正在尝试解析此 Link 以搜索结果
请select:
- 学校=全部
- 运动=足球
- 会议=全部
- 年份=2005-2006
- 州=全部
此搜索结果包含 226 个条目,我想解析所有 226 个条目并将其转换为 pandas 数据帧,这样数据帧包含 "School"、"Conference"、"GSR"、'FGR' 和 'State'。所以,到目前为止,我能够解析 Table headers,但我无法解析来自 table 的数据。请告知代码和解释。
注意:我是Python和Beautifulsoup的新手。
到目前为止我尝试过的代码:
url='https://web3.ncaa.org/aprsearch/gsrsearch'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
if len(T)!=10:
break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
到目前为止的输出:
您可以粘贴 headers 和负载,然后使用 .post
。我仍在学习如何正确使用它,并且不太确定到底需要什么(或者 "sensitive info" 是什么,这就是为什么我涂掉了其中的一些……就像我说的,我还在学习),但是设法 return json。
这将 return json 然后仅转换为数据帧。
您可以通过对页面执行 "Inspect" 来获取 headers 和有效载荷,然后单击 XHR(您可能需要刷新页面以便 gsrsearch
出现。然后只需点击它并滚动找到它。不过你必须把引号放在那里。
代码:
import json
import requests
from pandas.io.json import json_normalize
url='https://web3.ncaa.org/aprsearch/gsrsearch'
# Here's where you'll put your headers from Inspect
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
...
...
...
'X-Requested-With': 'XMLHttpRequest'}
# Here's where you put Form Data from Inspect
payload = {'schoolOrgId': '',
'conferenceOrgId':'',
'sportCode': 'MFB',
'cohortYear': '2005', # I changed this to year 2005
'state':'',
... }
r = requests.post(url, headers=headers, data=payload)
jsonStr = r.text
jsonObj = json.loads(jsonStr)
df = json_normalize(jsonObj)
输出:
print (df)
cohortYear conferenceId ... sportDesc state
0 2005 875 ... Football OH
1 2005 916 ... Football AL
2 2005 916 ... Football AL
3 2005 911 ... Football AL
4 2005 24312 ... Football AL
5 2005 846 ... Football NY
6 2005 916 ... Football MS
7 2005 912 ... Football NC
8 2005 905 ... Football AZ
9 2005 905 ... Football AZ
10 2005 818 ... Football AR
11 2005 911 ... Football AR
12 2005 911 ... Football AL
13 2005 902 ... Football TN
14 2005 875 ... Football IN
15 2005 826 ... Football SC
16 2005 25354 ... Football TX
17 2005 876 ... Football FL
18 2005 5486 ... Football ID
19 2005 821 ... Football MA
20 2005 875 ... Football OH
21 2005 0 ... Football UT
22 2005 865 ... Football RI
23 2005 846 ... Football RI
24 2005 838 ... Football PA
25 2005 875 ... Football NY
26 2005 21451 ... Football IN
27 2005 0 ... Football CA
28 2005 923 ... Football CA
29 2005 825 ... Football CA
.. ... ... ... ... ...
210 2005 0 ... Football MD
211 2005 923 ... Football UT
212 2005 905 ... Football UT
213 2005 21451 ... Football IN
214 2005 911 ... Football TN
215 2005 837 ... Football PA
216 2005 826 ... Football VA
217 2005 821 ... Football VA
218 2005 821 ... Football VA
219 2005 846 ... Football NY
220 2005 821 ... Football NC
221 2005 905 ... Football WA
222 2005 905 ... Football WA
223 2005 825 ... Football UT
224 2005 823 ... Football WV
225 2005 912 ... Football NC
226 2005 853 ... Football IL
227 2005 818 ... Football KY
228 2005 875 ... Football MI
229 2005 837 ... Football VA
230 2005 827 ... Football WI
231 2005 5486 ... Football WY
232 2005 865 ... Football CT
233 2005 853 ... Football OH
234 2005 914 ... Football AR
235 2005 912 ... Football NC
236 2005 826 ... Football NC
237 2005 826 ... Football SC
238 2005 916 ... Football AR
239 2005 912 ... Football SC
[240 rows x 12 columns]