连接多个具有相同列名的 CSV
Concat multiple CSV's with the same column name
我在连接这些 pandas 数据帧时遇到了问题,因为我不断收到一条错误消息 pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
我也在努力让我的代码不那么笨拙,运行 更流畅。我还想知道是否有一种方法可以使用 python 在一个 csv 上获取多个页面。任何帮助都会很棒。
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
t = URL + "&page_number="
URL2 = t + "2"
URL3 = t + "3"
s = requests.Session()
data = []
page = s.get(URL,headers=headers)
page2 = s.get(URL2, headers=headers)
page3 = s.get(URL3, headers=headers)
soup = BeautifulSoup(page.content, "lxml")
soup2 = BeautifulSoup(page2.content, "lxml")
soup3 = BeautifulSoup(page3.content, "lxml")
for row in soup.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup2.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup3.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data[2:], columns=data[1])
df3 = pd.DataFrame(data[3:], columns=data[2])
final = pd.concat([df1, df2, df3], axis=0)
final.to_csv('Street.csv', encoding='utf-8')
通常人们会遍历页码并连接数据框列表,但如果你只有三页,你的代码就没问题。
因为 for row in ...
总是写入 data
,你的最终数据帧是 df1,但你只需要删除 列名 行。
final = df1[df1['Property ID ↓ Geographic ID ↓']!='Property ID ↓ Geographic ID ↓']
会发生什么?
如前所述@Zach Young data
已经保存了所有你想转换成 one 数据帧的行。所以这不是pandas
的问题,而是如何收集信息的问题。
如何修复?
基于您问题中代码的方法是选择更具体的 table 数据 - 请注意选择中的 tbody
,这将排除 headers:
for row in soup.select('#propertysearchresults tbody tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
创建数据框时,您可以另外设置列 headers:
pd.DataFrame(data, columns=[c.get_text(' ',strip=True) for c in soup.select('#propertysearchresults thead td')])
例子
这将展示如何迭代包含您的 table 的网站的不同页面:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
data = []
while True:
page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")
for row in soup.select('#propertysearchresults tbody tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
if (a := soup.select_one('#page_selector strong + a')):
URL = "https://www.collincad.org"+a['href']
else:
break
pd.DataFrame(data, columns=[c.get_text(' ',strip=True) for c in soup.select('#propertysearchresults thead td')])
输出
Property ID ↓ Geographic ID ↓
Owner Name
Property Address
Legal Description
2021 Market Value
1
2709013 R-10644-00H-0010-1
PARTHASARATHY SURESH & ANITHA HARIKRISHNAN
12209 Willowgate Dr Frisco, TX 75035
Ridgeview At Panther Creek Phase 2, Blk H, Lot 1
3,019
...
...
...
...
...
...
61
2129238 R-4734-00C-0110-1
HEPFER ARRON
990 Willowgate Dr Prosper, TX 75078
Willow Ridge Phase One, Blk C, Lot 11
9,795
而不是你最后几行代码:
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data[2:], columns=data[1])
df3 = pd.DataFrame(data[3:], columns=data[2])
final = pd.concat([df1, df2, df3], axis=0)
final.to_csv('Street.csv', encoding='utf-8')
你可以使用这个(避免分割成不同的数据帧和连接):
final = pd.DataFrame(data[1:], columns=data[0]) # Sets the first row as the column names
final = final.iloc[:,1:] # Gets rid of the additional index column
我在连接这些 pandas 数据帧时遇到了问题,因为我不断收到一条错误消息 pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
我也在努力让我的代码不那么笨拙,运行 更流畅。我还想知道是否有一种方法可以使用 python 在一个 csv 上获取多个页面。任何帮助都会很棒。
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
t = URL + "&page_number="
URL2 = t + "2"
URL3 = t + "3"
s = requests.Session()
data = []
page = s.get(URL,headers=headers)
page2 = s.get(URL2, headers=headers)
page3 = s.get(URL3, headers=headers)
soup = BeautifulSoup(page.content, "lxml")
soup2 = BeautifulSoup(page2.content, "lxml")
soup3 = BeautifulSoup(page3.content, "lxml")
for row in soup.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup2.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
for row in soup3.select('#propertysearchresults tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data[2:], columns=data[1])
df3 = pd.DataFrame(data[3:], columns=data[2])
final = pd.concat([df1, df2, df3], axis=0)
final.to_csv('Street.csv', encoding='utf-8')
通常人们会遍历页码并连接数据框列表,但如果你只有三页,你的代码就没问题。
因为 for row in ...
总是写入 data
,你的最终数据帧是 df1,但你只需要删除 列名 行。
final = df1[df1['Property ID ↓ Geographic ID ↓']!='Property ID ↓ Geographic ID ↓']
会发生什么?
如前所述@Zach Young data
已经保存了所有你想转换成 one 数据帧的行。所以这不是pandas
的问题,而是如何收集信息的问题。
如何修复?
基于您问题中代码的方法是选择更具体的 table 数据 - 请注意选择中的 tbody
,这将排除 headers:
for row in soup.select('#propertysearchresults tbody tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
创建数据框时,您可以另外设置列 headers:
pd.DataFrame(data, columns=[c.get_text(' ',strip=True) for c in soup.select('#propertysearchresults thead td')])
例子
这将展示如何迭代包含您的 table 的网站的不同页面:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
"=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"
s = requests.Session()
data = []
while True:
page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")
for row in soup.select('#propertysearchresults tbody tr'):
data.append([c.get_text(' ',strip=True) for c in row.select('td')])
if (a := soup.select_one('#page_selector strong + a')):
URL = "https://www.collincad.org"+a['href']
else:
break
pd.DataFrame(data, columns=[c.get_text(' ',strip=True) for c in soup.select('#propertysearchresults thead td')])
输出
Property ID ↓ Geographic ID ↓ | Owner Name | Property Address | Legal Description | 2021 Market Value | |
---|---|---|---|---|---|
1 | 2709013 R-10644-00H-0010-1 | PARTHASARATHY SURESH & ANITHA HARIKRISHNAN | 12209 Willowgate Dr Frisco, TX 75035 | Ridgeview At Panther Creek Phase 2, Blk H, Lot 1 | 3,019 |
... | ... | ... | ... | ... | ... |
61 | 2129238 R-4734-00C-0110-1 | HEPFER ARRON | 990 Willowgate Dr Prosper, TX 75078 | Willow Ridge Phase One, Blk C, Lot 11 | 9,795 |
而不是你最后几行代码:
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data[2:], columns=data[1])
df3 = pd.DataFrame(data[3:], columns=data[2])
final = pd.concat([df1, df2, df3], axis=0)
final.to_csv('Street.csv', encoding='utf-8')
你可以使用这个(避免分割成不同的数据帧和连接):
final = pd.DataFrame(data[1:], columns=data[0]) # Sets the first row as the column names
final = final.iloc[:,1:] # Gets rid of the additional index column