while 循环数据不附加到 while 循环之外的列表
While loop data not appending to list outside of while loop
我正在尝试抓取数据,将其写入 pd 系列,然后进入 while 循环,以便在每次迭代后将网站的其余页面附加到原始系列(位于 while 循环之外)。我不确定为什么这不起作用。这是我卡住的地方:
current_url = 'https://www.yellowpages.com/search?search_terms=hvac&geo_location_terms=97080'
def get_data_run(current_url):
company_names1 = get_company_name(current_url)
print(company_names1) #1
page = 1
max_page = 3
company_names1 = paginate(current_url, page, max_page, company_names1)
print(company_names1) #2
def paginate(current_url, page, max_page, company_names1):
while (page <= max_page):
new_url = current_url + f"&page={page}"
print(new_url)
company_names = get_company_name(new_url)
company_names1.append(company_names)
print(company_names) #3
print(company_names1) #4
page +=1
if page == max_page:
return company_names1
def get_company_name(url):
company_names = []
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
box = list(soup.findAll("div", {"class": "result"}))
for i in range(len(box)):
try:
company_names.append(box[i].find("a", {"class": "business-name"}).text.strip())
except Exception:
company_names.append("null")
else:
continue
company_names = pd.Series(company_names, dtype='string')
return company_names
get_data_run(current_url)
我已经标记了 company_names1
和 company_names
的不同印刷品和所有印刷品,并且每次 company_names1
它甚至在附加 [=14 之后印刷相同系列的公司=] 在 while 循环中。我无法理解的是,当我打印 company_names
(#3) 时,它会打印下一页公司名称。我不明白为什么它没有附加到 while 循环内,然后为什么它没有成功返回函数外部并在 #2 打印中打印组合系列。谢谢!
更新:
这是一些示例输出:
当我打印 #3:
(pyfinance) justinbenfit@MacBook-Pro-3 yellowpages_scrape % /usr/local/anaconda3/envs/pyfinance/bin/python /Users/justinbenfit/Desktop/yellowpages_scrape/test.py
0 Honke Heating & Air Conditioning
1 Climate Kings Heating & Ac
2 Mike's Truck & Auto Service
3 One Hour Heating & Air Conditioning
4 Morgan Heating & Cooling Inc
5 Rnr Heating Venting & Air Conditioning
6 Universal HVAC Inc
7 Mr Furnace
8 Affordable Excellence Heating
9 Green Air Products
10 David Eugene Neketin
11 Century Heating & Air Cond
12 Appliance Wizard
13 Precision Energy Solutions Inc.
14 Portland Heating & Air Conditioning Co
15 Mhc
16 American Pride Heating and Cooling, LLC
17 Tri Star Western
18 Comfort Zone Heat & Air Inc
19 Don's Air-Care Inc
20 Chuck's Heating & Cooling
21 Mt. Hood Heating Cooling & Refrigeration
22 Chuck's Heating & Cooling
23 Mr. Furnace
24 America's Same Day Service
25 Arctic Commercial Refrigeration LLC
26 Apex Refrigeration
27 Ben's Heating & Air Conditioning LLC
28 David's Appliance Inc
29 Wolcott Heating & Cooling
dtype: string
0 Air-Trix
1 Johnstone Supply
2 Buss Heating & Cooling Inc
3 The Heat Exchange
4 Hoodview Heating & Air Conditioning
5 Loomis Heating Cooling & Refrigeration
6 All About Air Heating & Cooling
7 Hanson Heating
8 Sparks Heating & Cooling
9 Interior Comfort Systems
10 P D X Heating & Cooling
11 Apcom Power Inc
12 Area Heating Inc
13 Four Seasons Heating Air Conditioning & Servic...
14 Perfect Climate Inc
15 Combustion Consultants Inc
16 Classic Heat Source, Inc.
17 Multnomah Heating, Inc
18 Apollo Plumbing, Heating & Air Conditioning - OR
19 Art's Furnace & Air Cond
20 Kurchel Heating
21 P & O Construction Inc
22 Systems Management NW
23 Bridgetown Heating
24 Amana Heating & Air Conditioning Systems
25 QualitySmith
26 Wilbert Jr, Wilson
27 Faith Heating & Air Conditioning Inc
28 Northwest Commercial Heating & Air Conditionin...
29 Heat Master Corp
dtype: string
当我打印#1、#2 和#4 时
0 Honke Heating & Air Conditioning
1 Climate Kings Heating & Ac
2 Mike's Truck & Auto Service
3 One Hour Heating & Air Conditioning
4 Morgan Heating & Cooling Inc
5 Rnr Heating Venting & Air Conditioning
6 Universal HVAC Inc
7 Mr Furnace
8 Affordable Excellence Heating
9 Green Air Products
10 David Eugene Neketin
11 Century Heating & Air Cond
12 Appliance Wizard
13 Precision Energy Solutions Inc.
14 Portland Heating & Air Conditioning Co
15 Mhc
16 American Pride Heating and Cooling, LLC
17 Tri Star Western
18 Comfort Zone Heat & Air Inc
19 Don's Air-Care Inc
20 Chuck's Heating & Cooling
21 Chuck's Heating & Cooling
22 Mr. Furnace
23 Mt. Hood Heating Cooling & Refrigeration
24 America's Same Day Service
25 Arctic Commercial Refrigeration LLC
26 Apex Refrigeration
27 Ben's Heating & Air Conditioning LLC
28 David's Appliance Inc
29 Wolcott Heating & Cooling
dtype: string
问题是您将 pd.Series
视为 list
,但前者是不可变的,而后者是可变的。这意味着,将数据附加到列表的工作方式如下:
lst = [1,2,3]
lst.append(4)
print(lst)
# [1, 2, 3, 4]
对象更改而无需显式分配。如果您对 Series
执行相同操作,则会发生以下情况:
series = pd.Series([1,2,3])
series.append(pd.Series([4]))
print(series)
输出为:
0 1
1 2
2 3
dtype: int64
因此,要更新系列,您必须替换原始对象或创建一个新对象。如果没有分配,它将不会存储在内存中:
series = pd.Series([1,2,3])
series = series.append(pd.Series([4]))
print(series)
输出:
0 1
1 2
2 3
0 4
dtype: int64
如果您的问题出在 paginate
函数中,您应该更改此行:
company_names1.append(company_names)
至:
company_names1 = company_names1.append(company_names)
一切正常
我正在尝试抓取数据,将其写入 pd 系列,然后进入 while 循环,以便在每次迭代后将网站的其余页面附加到原始系列(位于 while 循环之外)。我不确定为什么这不起作用。这是我卡住的地方:
current_url = 'https://www.yellowpages.com/search?search_terms=hvac&geo_location_terms=97080'
def get_data_run(current_url):
company_names1 = get_company_name(current_url)
print(company_names1) #1
page = 1
max_page = 3
company_names1 = paginate(current_url, page, max_page, company_names1)
print(company_names1) #2
def paginate(current_url, page, max_page, company_names1):
while (page <= max_page):
new_url = current_url + f"&page={page}"
print(new_url)
company_names = get_company_name(new_url)
company_names1.append(company_names)
print(company_names) #3
print(company_names1) #4
page +=1
if page == max_page:
return company_names1
def get_company_name(url):
company_names = []
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
box = list(soup.findAll("div", {"class": "result"}))
for i in range(len(box)):
try:
company_names.append(box[i].find("a", {"class": "business-name"}).text.strip())
except Exception:
company_names.append("null")
else:
continue
company_names = pd.Series(company_names, dtype='string')
return company_names
get_data_run(current_url)
我已经标记了 company_names1
和 company_names
的不同印刷品和所有印刷品,并且每次 company_names1
它甚至在附加 [=14 之后印刷相同系列的公司=] 在 while 循环中。我无法理解的是,当我打印 company_names
(#3) 时,它会打印下一页公司名称。我不明白为什么它没有附加到 while 循环内,然后为什么它没有成功返回函数外部并在 #2 打印中打印组合系列。谢谢!
更新: 这是一些示例输出:
当我打印 #3:
(pyfinance) justinbenfit@MacBook-Pro-3 yellowpages_scrape % /usr/local/anaconda3/envs/pyfinance/bin/python /Users/justinbenfit/Desktop/yellowpages_scrape/test.py
0 Honke Heating & Air Conditioning
1 Climate Kings Heating & Ac
2 Mike's Truck & Auto Service
3 One Hour Heating & Air Conditioning
4 Morgan Heating & Cooling Inc
5 Rnr Heating Venting & Air Conditioning
6 Universal HVAC Inc
7 Mr Furnace
8 Affordable Excellence Heating
9 Green Air Products
10 David Eugene Neketin
11 Century Heating & Air Cond
12 Appliance Wizard
13 Precision Energy Solutions Inc.
14 Portland Heating & Air Conditioning Co
15 Mhc
16 American Pride Heating and Cooling, LLC
17 Tri Star Western
18 Comfort Zone Heat & Air Inc
19 Don's Air-Care Inc
20 Chuck's Heating & Cooling
21 Mt. Hood Heating Cooling & Refrigeration
22 Chuck's Heating & Cooling
23 Mr. Furnace
24 America's Same Day Service
25 Arctic Commercial Refrigeration LLC
26 Apex Refrigeration
27 Ben's Heating & Air Conditioning LLC
28 David's Appliance Inc
29 Wolcott Heating & Cooling
dtype: string
0 Air-Trix
1 Johnstone Supply
2 Buss Heating & Cooling Inc
3 The Heat Exchange
4 Hoodview Heating & Air Conditioning
5 Loomis Heating Cooling & Refrigeration
6 All About Air Heating & Cooling
7 Hanson Heating
8 Sparks Heating & Cooling
9 Interior Comfort Systems
10 P D X Heating & Cooling
11 Apcom Power Inc
12 Area Heating Inc
13 Four Seasons Heating Air Conditioning & Servic...
14 Perfect Climate Inc
15 Combustion Consultants Inc
16 Classic Heat Source, Inc.
17 Multnomah Heating, Inc
18 Apollo Plumbing, Heating & Air Conditioning - OR
19 Art's Furnace & Air Cond
20 Kurchel Heating
21 P & O Construction Inc
22 Systems Management NW
23 Bridgetown Heating
24 Amana Heating & Air Conditioning Systems
25 QualitySmith
26 Wilbert Jr, Wilson
27 Faith Heating & Air Conditioning Inc
28 Northwest Commercial Heating & Air Conditionin...
29 Heat Master Corp
dtype: string
当我打印#1、#2 和#4 时
0 Honke Heating & Air Conditioning
1 Climate Kings Heating & Ac
2 Mike's Truck & Auto Service
3 One Hour Heating & Air Conditioning
4 Morgan Heating & Cooling Inc
5 Rnr Heating Venting & Air Conditioning
6 Universal HVAC Inc
7 Mr Furnace
8 Affordable Excellence Heating
9 Green Air Products
10 David Eugene Neketin
11 Century Heating & Air Cond
12 Appliance Wizard
13 Precision Energy Solutions Inc.
14 Portland Heating & Air Conditioning Co
15 Mhc
16 American Pride Heating and Cooling, LLC
17 Tri Star Western
18 Comfort Zone Heat & Air Inc
19 Don's Air-Care Inc
20 Chuck's Heating & Cooling
21 Chuck's Heating & Cooling
22 Mr. Furnace
23 Mt. Hood Heating Cooling & Refrigeration
24 America's Same Day Service
25 Arctic Commercial Refrigeration LLC
26 Apex Refrigeration
27 Ben's Heating & Air Conditioning LLC
28 David's Appliance Inc
29 Wolcott Heating & Cooling
dtype: string
问题是您将 pd.Series
视为 list
,但前者是不可变的,而后者是可变的。这意味着,将数据附加到列表的工作方式如下:
lst = [1,2,3]
lst.append(4)
print(lst)
# [1, 2, 3, 4]
对象更改而无需显式分配。如果您对 Series
执行相同操作,则会发生以下情况:
series = pd.Series([1,2,3])
series.append(pd.Series([4]))
print(series)
输出为:
0 1
1 2
2 3
dtype: int64
因此,要更新系列,您必须替换原始对象或创建一个新对象。如果没有分配,它将不会存储在内存中:
series = pd.Series([1,2,3])
series = series.append(pd.Series([4]))
print(series)
输出:
0 1
1 2
2 3
0 4
dtype: int64
如果您的问题出在 paginate
函数中,您应该更改此行:
company_names1.append(company_names)
至:
company_names1 = company_names1.append(company_names)
一切正常