清理网络抓取数据并合并在一起?
cleaning up web scrape data and combining together?
网站URL
是https://www.justia.com/lawyers/criminal-law/maine
我只想抓取律师的姓名和他们的办公室。
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
Lawyer_name= soup.find_all("a","url main-profile-link")
for i in Lawyer_name:
print(i.find(text=True))
address= soup.find_all("span","-address -hide-landscape-tablet")
for x in address:
print(x.find_all(text=True))
名称打印出来只是找到,但地址正在打印我想删除的额外内容:
['\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t88 Hammond Street', '\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBangor,\t\t\t\t\tME 04401\t\t\t\t\t\t ']
所以我试图为每位律师获得的输出是这样的(第一个示例):
Hunter J Tzovarras
88 Hammond Street
Bangor, ME 04401
我想弄清楚的两个问题
- How can I clean up the address so it is easier to read?
- How can I save the matching lawyer name with the address so they
don't get mixed up.
使用x.get_text()
代替x.find_all
for x in address:
print(x.get_text(strip=True))
完整的工作代码:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
n=[]
ad=[]
Lawyer_name= [x.get('title').strip() for x in soup.select('a.lawyer-avatar')]
n.extend(Lawyer_name)
#print(Lawyer_name)
address= [x.get_text(strip=True).replace('\t','').strip() for x in soup.find_all("span",class_="-address -hide-landscape-tablet")]
#print(address)
ad.extend(address)
df = pd.DataFrame(data=list(zip(n,ad)),columns=[['Lawyer_name','address']])
print(df)
输出:
Lawyer_name address
0 William T. Bly Esq 119 Main StreetKennebunk,ME 04043
1 John S. Webb 949 Main StreetSanford,ME 04073
2 William T. Bly Esq 20 Oak StreetEllsworth,ME 04605
3 Christopher Causey Esq 16 Middle StSaco,ME 04072
4 Robert Van Horn 88 Hammond StreetBangor,ME 04401
5 John S. Webb 37 Western Ave., Unit #307Kennebunk,ME 04043
6 Hunter J Tzovarras 4 Union Park RoadTopsham,ME 04086
7 Michael Stephen Bowser Jr. 241 Main StreetP.O. Box 57Saco,ME 04072
8 Richard Regan 6 City CenterSuite 301Portland,ME 04101
9 Robert Guillory Esq 75 Pearl St. Suite 400Portland,ME 04101
10 Dylan R. Boyd 160 Capitol StreetP.O. Box 79Augusta,ME 04332
11 Luke Rioux Esq 10 Stoney Brook LaneLyman,ME 04002
12 David G. Webbert 15 Columbia Street, Ste. 301Bangor,ME 04401
13 Amy Fairfield 32 Saco AveOld Orchard Beach,ME 04064
14 Mr. Richard Lyman Hartley 62 Portland Rd., Ste. 44Kennebunk,ME 04043
15 Neal L Weinstein Esq 647 U.S. Route One#203York,ME 03909
16 Albert Hansen 76 Tandberg Trail (Route 115)Windham,ME 04062
17 Russell Goldsmith Esq Two Canal PlazaPO Box 4600Portland,ME 04112
18 Miklos Pongratz Esq 18 Market Square Suite 5Houlton,ME 04730
19 Bradford Pattershall Esq 5 Island View DrCumberland Foreside,ME 04110
20 Michele D L Kenney 12 Silver StreetP.O. Box 559Waterville,ME 04903
21 John Simpson 344 Mount Hope Ave.Bangor,ME 04402
22 Mariah America Gleaton 192 Main StreetEllsworth,ME 04605
23 Wayne Foote Esq 85 Brackett StreetPortland,ME 04102
24 Will Ashe 16 Union StreetBrunswick,ME 04011
25 Peter J Cyr Esq 482 Congress Street Suite 402Portland,ME 04101
26 Jonathan Steven Handelman Esq PO Box 335York,ME 03909
27 Richard Smith Berne 36 Ossipee Trl W.Standish,ME 04084
28 Meredith G. Schmid 75 Pearl St.Suite 216Portland,ME 04101
29 Gregory LeClerc 28 Long Sands Road, Suite 5York,ME 03909
30 Cory McKenna 20 Mechanic StCamden,ME 04843
31 Thomas P. Elias P.O. Box 1049304 Hancock St. Suite 1KBangor,ME...
32 Christopher MacLean 1250 Forest Avenue, Ste 3APortland,ME 04103
33 Zachary J. Smith 415 Congress StreetSuite 202Portland,ME 04101
34 Stephen Sweatt 919 Ridge RoadP.O. BOX 119Bowdoinham,ME 04008
35 Michael Turndorf Esq 1250 Forest Avenue, Ste 3APortland,ME 04103
36 Andrews Bruce Campbell Esq 133 State StreetAugusta,ME 04330
37 Timothy Zerillo 110 Portland StreetFryeburg,ME 04037
38 Walter McKee Esq 440 Walnut Hill RdNorth Yarmouth,ME 04097
39 Shelley Carter 70 State StreetEllsworth,ME 04605
对于您的第二个查询,您可以将它们保存到这样的字典中 -
url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
# parse all names and save them in a list
lawyer_names = soup.find_all("a","url main-profile-link")
lawyer_names = [name.find(text=True).strip() for name in lawyer_names]
# parse all addresses and save them in a list
lawyer_addresses = soup.find_all("span","-address -hide-landscape-tablet")
lawyer_addresses = [re.sub('\s+',' ', address.get_text(strip=True)) for address in lawyer_addresses]
# map names with addresses
lawyer_dict = dict(zip(lawyer_names, lawyer_addresses))
print(lawyer_dict)
输出字典-
{'Albert Hansen': '62 Portland Rd., Ste. 44Kennebunk, ME 04043',
'Amber Lynn Tucker': '415 Congress St., Ste. 202P.O. Box 7542Portland, ME 04112',
'Amy Fairfield': '10 Stoney Brook LaneLyman, ME 04002',
'Andrews Bruce Campbell Esq': '919 Ridge RoadP.O. BOX 119Bowdoinham, ME 04008',
'Bradford Pattershall Esq': 'Two Canal PlazaPO Box 4600Portland, ME 04112',
'Christopher Causey Esq': '949 Main StreetSanford, ME 04073',
'Cory McKenna': '75 Pearl St.Suite 216Portland, ME 04101',
'David G. Webbert': '160 Capitol StreetP.O. Box 79Augusta, ME 04332',
'David Nelson Wood Esq': '120 Main StreetSuite 110Saco, ME 04072',
'Dylan R. Boyd': '6 City CenterSuite 301Portland, ME 04101',
'Gregory LeClerc': '36 Ossipee Trl W.Standish, ME 04084',
'Hunter J Tzovarras': '88 Hammond StreetBangor, ME 04401',
'John S. Webb': '16 Middle StSaco, ME 04072',
'John Simpson': '5 Island View DrCumberland Foreside, ME 04110',
'Jonathan Steven Handelman Esq': '16 Union StreetBrunswick, ME 04011',
'Luke Rioux Esq': '75 Pearl St. Suite 400Portland, ME 04101',
'Mariah America Gleaton': '12 Silver StreetP.O. Box 559Waterville, ME 04903',
'Meredith G. Schmid': 'PO Box 335York, ME 03909',
'Michael Stephen Bowser Jr.': '37 Western Ave., Unit #307Kennebunk, ME 04043',
'Michael Turndorf Esq': '415 Congress StreetSuite 202Portland, ME 04101',
'Michele D L Kenney': '18 Market Square Suite 5Houlton, ME 04730',
'Miklos Pongratz Esq': '76 Tandberg Trail (Route 115)Windham, ME 04062',
'Mr. Richard Lyman Hartley': '15 Columbia Street, Ste. 301Bangor, ME 04401',
'Neal L Weinstein Esq': '32 Saco AveOld Orchard Beach, ME 04064',
'Peter J Cyr Esq': '85 Brackett StreetPortland, ME 04102',
'Richard Regan': '4 Union Park RoadTopsham, ME 04086',
'Richard Smith Berne': '482 Congress Street Suite 402Portland, ME 04101',
'Robert Guillory Esq': '241 Main StreetP.O. Box 57Saco, ME 04072',
'Robert Van Horn': '20 Oak StreetEllsworth, ME 04605',
'Russell Goldsmith Esq': '647 U.S. Route One#203York, ME 03909',
'Shelley Carter': '110 Portland StreetFryeburg, ME 04037',
'Thaddeus Day Esq': '440 Walnut Hill RdNorth Yarmouth, ME 04097',
'Thomas P. Elias': '28 Long Sands Road, Suite 5York, ME 03909',
'Timothy Zerillo': '1250 Forest Avenue, Ste 3APortland, ME 04103',
'Todd H Crawford Jr': '1288 Roosevelt Trl, Ste #3P.O. Box 753Raymond, ME 04071',
'Walter McKee Esq': '133 State StreetAugusta, ME 04330',
'Wayne Foote Esq': '344 Mount Hope Ave.Bangor, ME 04402',
'Will Ashe': '192 Main StreetEllsworth, ME 04605',
'William T. Bly Esq': '119 Main StreetKennebunk, ME 04043',
'Zachary J. Smith': 'P.O. Box 1049304 Hancock St. Suite 1KBangor, ME 04401'}
网站URL
是https://www.justia.com/lawyers/criminal-law/maine
我只想抓取律师的姓名和他们的办公室。
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
Lawyer_name= soup.find_all("a","url main-profile-link")
for i in Lawyer_name:
print(i.find(text=True))
address= soup.find_all("span","-address -hide-landscape-tablet")
for x in address:
print(x.find_all(text=True))
名称打印出来只是找到,但地址正在打印我想删除的额外内容:
['\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t88 Hammond Street', '\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBangor,\t\t\t\t\tME 04401\t\t\t\t\t\t ']
所以我试图为每位律师获得的输出是这样的(第一个示例):
Hunter J Tzovarras
88 Hammond Street
Bangor, ME 04401
我想弄清楚的两个问题
- How can I clean up the address so it is easier to read?
- How can I save the matching lawyer name with the address so they don't get mixed up.
使用x.get_text()
代替x.find_all
for x in address:
print(x.get_text(strip=True))
完整的工作代码:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
n=[]
ad=[]
Lawyer_name= [x.get('title').strip() for x in soup.select('a.lawyer-avatar')]
n.extend(Lawyer_name)
#print(Lawyer_name)
address= [x.get_text(strip=True).replace('\t','').strip() for x in soup.find_all("span",class_="-address -hide-landscape-tablet")]
#print(address)
ad.extend(address)
df = pd.DataFrame(data=list(zip(n,ad)),columns=[['Lawyer_name','address']])
print(df)
输出:
Lawyer_name address
0 William T. Bly Esq 119 Main StreetKennebunk,ME 04043
1 John S. Webb 949 Main StreetSanford,ME 04073
2 William T. Bly Esq 20 Oak StreetEllsworth,ME 04605
3 Christopher Causey Esq 16 Middle StSaco,ME 04072
4 Robert Van Horn 88 Hammond StreetBangor,ME 04401
5 John S. Webb 37 Western Ave., Unit #307Kennebunk,ME 04043
6 Hunter J Tzovarras 4 Union Park RoadTopsham,ME 04086
7 Michael Stephen Bowser Jr. 241 Main StreetP.O. Box 57Saco,ME 04072
8 Richard Regan 6 City CenterSuite 301Portland,ME 04101
9 Robert Guillory Esq 75 Pearl St. Suite 400Portland,ME 04101
10 Dylan R. Boyd 160 Capitol StreetP.O. Box 79Augusta,ME 04332
11 Luke Rioux Esq 10 Stoney Brook LaneLyman,ME 04002
12 David G. Webbert 15 Columbia Street, Ste. 301Bangor,ME 04401
13 Amy Fairfield 32 Saco AveOld Orchard Beach,ME 04064
14 Mr. Richard Lyman Hartley 62 Portland Rd., Ste. 44Kennebunk,ME 04043
15 Neal L Weinstein Esq 647 U.S. Route One#203York,ME 03909
16 Albert Hansen 76 Tandberg Trail (Route 115)Windham,ME 04062
17 Russell Goldsmith Esq Two Canal PlazaPO Box 4600Portland,ME 04112
18 Miklos Pongratz Esq 18 Market Square Suite 5Houlton,ME 04730
19 Bradford Pattershall Esq 5 Island View DrCumberland Foreside,ME 04110
20 Michele D L Kenney 12 Silver StreetP.O. Box 559Waterville,ME 04903
21 John Simpson 344 Mount Hope Ave.Bangor,ME 04402
22 Mariah America Gleaton 192 Main StreetEllsworth,ME 04605
23 Wayne Foote Esq 85 Brackett StreetPortland,ME 04102
24 Will Ashe 16 Union StreetBrunswick,ME 04011
25 Peter J Cyr Esq 482 Congress Street Suite 402Portland,ME 04101
26 Jonathan Steven Handelman Esq PO Box 335York,ME 03909
27 Richard Smith Berne 36 Ossipee Trl W.Standish,ME 04084
28 Meredith G. Schmid 75 Pearl St.Suite 216Portland,ME 04101
29 Gregory LeClerc 28 Long Sands Road, Suite 5York,ME 03909
30 Cory McKenna 20 Mechanic StCamden,ME 04843
31 Thomas P. Elias P.O. Box 1049304 Hancock St. Suite 1KBangor,ME...
32 Christopher MacLean 1250 Forest Avenue, Ste 3APortland,ME 04103
33 Zachary J. Smith 415 Congress StreetSuite 202Portland,ME 04101
34 Stephen Sweatt 919 Ridge RoadP.O. BOX 119Bowdoinham,ME 04008
35 Michael Turndorf Esq 1250 Forest Avenue, Ste 3APortland,ME 04103
36 Andrews Bruce Campbell Esq 133 State StreetAugusta,ME 04330
37 Timothy Zerillo 110 Portland StreetFryeburg,ME 04037
38 Walter McKee Esq 440 Walnut Hill RdNorth Yarmouth,ME 04097
39 Shelley Carter 70 State StreetEllsworth,ME 04605
对于您的第二个查询,您可以将它们保存到这样的字典中 -
url = 'https://www.justia.com/lawyers/criminal-law/maine'
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
# parse all names and save them in a list
lawyer_names = soup.find_all("a","url main-profile-link")
lawyer_names = [name.find(text=True).strip() for name in lawyer_names]
# parse all addresses and save them in a list
lawyer_addresses = soup.find_all("span","-address -hide-landscape-tablet")
lawyer_addresses = [re.sub('\s+',' ', address.get_text(strip=True)) for address in lawyer_addresses]
# map names with addresses
lawyer_dict = dict(zip(lawyer_names, lawyer_addresses))
print(lawyer_dict)
输出字典-
{'Albert Hansen': '62 Portland Rd., Ste. 44Kennebunk, ME 04043',
'Amber Lynn Tucker': '415 Congress St., Ste. 202P.O. Box 7542Portland, ME 04112',
'Amy Fairfield': '10 Stoney Brook LaneLyman, ME 04002',
'Andrews Bruce Campbell Esq': '919 Ridge RoadP.O. BOX 119Bowdoinham, ME 04008',
'Bradford Pattershall Esq': 'Two Canal PlazaPO Box 4600Portland, ME 04112',
'Christopher Causey Esq': '949 Main StreetSanford, ME 04073',
'Cory McKenna': '75 Pearl St.Suite 216Portland, ME 04101',
'David G. Webbert': '160 Capitol StreetP.O. Box 79Augusta, ME 04332',
'David Nelson Wood Esq': '120 Main StreetSuite 110Saco, ME 04072',
'Dylan R. Boyd': '6 City CenterSuite 301Portland, ME 04101',
'Gregory LeClerc': '36 Ossipee Trl W.Standish, ME 04084',
'Hunter J Tzovarras': '88 Hammond StreetBangor, ME 04401',
'John S. Webb': '16 Middle StSaco, ME 04072',
'John Simpson': '5 Island View DrCumberland Foreside, ME 04110',
'Jonathan Steven Handelman Esq': '16 Union StreetBrunswick, ME 04011',
'Luke Rioux Esq': '75 Pearl St. Suite 400Portland, ME 04101',
'Mariah America Gleaton': '12 Silver StreetP.O. Box 559Waterville, ME 04903',
'Meredith G. Schmid': 'PO Box 335York, ME 03909',
'Michael Stephen Bowser Jr.': '37 Western Ave., Unit #307Kennebunk, ME 04043',
'Michael Turndorf Esq': '415 Congress StreetSuite 202Portland, ME 04101',
'Michele D L Kenney': '18 Market Square Suite 5Houlton, ME 04730',
'Miklos Pongratz Esq': '76 Tandberg Trail (Route 115)Windham, ME 04062',
'Mr. Richard Lyman Hartley': '15 Columbia Street, Ste. 301Bangor, ME 04401',
'Neal L Weinstein Esq': '32 Saco AveOld Orchard Beach, ME 04064',
'Peter J Cyr Esq': '85 Brackett StreetPortland, ME 04102',
'Richard Regan': '4 Union Park RoadTopsham, ME 04086',
'Richard Smith Berne': '482 Congress Street Suite 402Portland, ME 04101',
'Robert Guillory Esq': '241 Main StreetP.O. Box 57Saco, ME 04072',
'Robert Van Horn': '20 Oak StreetEllsworth, ME 04605',
'Russell Goldsmith Esq': '647 U.S. Route One#203York, ME 03909',
'Shelley Carter': '110 Portland StreetFryeburg, ME 04037',
'Thaddeus Day Esq': '440 Walnut Hill RdNorth Yarmouth, ME 04097',
'Thomas P. Elias': '28 Long Sands Road, Suite 5York, ME 03909',
'Timothy Zerillo': '1250 Forest Avenue, Ste 3APortland, ME 04103',
'Todd H Crawford Jr': '1288 Roosevelt Trl, Ste #3P.O. Box 753Raymond, ME 04071',
'Walter McKee Esq': '133 State StreetAugusta, ME 04330',
'Wayne Foote Esq': '344 Mount Hope Ave.Bangor, ME 04402',
'Will Ashe': '192 Main StreetEllsworth, ME 04605',
'William T. Bly Esq': '119 Main StreetKennebunk, ME 04043',
'Zachary J. Smith': 'P.O. Box 1049304 Hancock St. Suite 1KBangor, ME 04401'}