我的代码循环遍历 url 但不循环 url 中的页面 Python
My code loop through urls but not pages in urls Python
我正在尝试从 url 中提取姓名和评论,我的代码循环遍历 url,但不包含其中的页面
len(名字) 给出 37
urls = ['https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/','https://www.f150forum.com/f118/adaptive-cruise-control-sensor-blockage-446041/']
name=[]
for url in urls:
with requests.Session() as req:
for item in range(1,3):
response = req.get(f"{url}index{item}/")
soup = BeautifulSoup(response.content, "html.parser")
posts = soup.find(id = "posts")
threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
for item in soup.findAll('a',attrs={"class":"bigusername"}):
result = [item.get_text(strip=True, separator=" ")]
name.append(result)
当我尝试 运行 这段代码时
for url in urls:
with requests.Session() as req:
for item in range(1,3):
response = req.get(f"{url}index{item}/")
soup = BeautifulSoup(response.content, "html.parser")
posts = soup.find(id = "posts")
threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
for item in soup.find_all('div', class_="ism-true"):
try:
item.find('div', class_="panel alt2").extract()
except AttributeError:
pass
try:
item.find('label').extract()
except AttributeError:
pass
result = [item.get_text(strip=True, separator=" ")]
comments1.append(item.text.strip())
len(comments1) 只给出 17,它只提取范围中的最后一页 page2。我怎样才能确保我的代码循环遍历所有页面。
您的代码非常仔细地遍历了 range(1,3)
中的两个索引值,忽略了您的提取结果。退出该循环后,然后 对 soup
的剩余值进行操作,这是上一个循环的最后一个值。
如果要遍历每个 soup
的内容,则必须缩进第二个循环以使其成为内部循环:
...
threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
for inner in soup.find_all('div', class_="ism-true"):
try:
inner.find('div', class_="panel alt2").extract()
...
这让你继续吗?
如果您想遍历所有页面,您可以定位下一个 link 直到它被禁用。
代码:
urls = ['https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/',
'https://www.f150forum.com/f118/adaptive-cruise-control-sensor-blockage-446041/']
name = []
for url in urls:
with requests.Session() as req:
index = 1
while (True):
# Checking url here
print(url + "index{}/".format(index))
response = req.get(url + "index{}/".format(index))
index = index + 1
soup = BeautifulSoup(response.content, "html.parser")
posts = soup.find(id="posts")
threadtitle = soup.find('h1', attrs={"class": "threadtitle"})
for item in soup.findAll('a', attrs={"class": "bigusername"}):
result = [item.get_text(strip=True, separator=" ")]
name.append(result)
# Check here next link is disable.
if 'disabled' in soup.select_one('a#mb_pagenext').attrs['class'][-1]:
break
print(len(name))
在控制台上,您可以看到它打印了所有页面 url 和总名称计数。
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index1/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index2/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index3/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index4/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index5/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index6/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index7/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index8/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index9/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index10/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index11/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index12/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index13/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index14/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index15/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index16/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index17/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index18/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index19/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index20/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index21/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index22/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index23/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index24/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index25/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index26/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index27/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index28/
https://www.f150forum.com/f118/adaptive-cruise-control-sensor-blockage-446041/index1/
https://www.f150forum.com/f118/adaptive-cruise-control-sensor-blockage-446041/index2/
280
我正在尝试从 url 中提取姓名和评论,我的代码循环遍历 url,但不包含其中的页面
len(名字) 给出 37
urls = ['https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/','https://www.f150forum.com/f118/adaptive-cruise-control-sensor-blockage-446041/']
name=[]
for url in urls:
with requests.Session() as req:
for item in range(1,3):
response = req.get(f"{url}index{item}/")
soup = BeautifulSoup(response.content, "html.parser")
posts = soup.find(id = "posts")
threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
for item in soup.findAll('a',attrs={"class":"bigusername"}):
result = [item.get_text(strip=True, separator=" ")]
name.append(result)
当我尝试 运行 这段代码时
for url in urls:
with requests.Session() as req:
for item in range(1,3):
response = req.get(f"{url}index{item}/")
soup = BeautifulSoup(response.content, "html.parser")
posts = soup.find(id = "posts")
threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
for item in soup.find_all('div', class_="ism-true"):
try:
item.find('div', class_="panel alt2").extract()
except AttributeError:
pass
try:
item.find('label').extract()
except AttributeError:
pass
result = [item.get_text(strip=True, separator=" ")]
comments1.append(item.text.strip())
len(comments1) 只给出 17,它只提取范围中的最后一页 page2。我怎样才能确保我的代码循环遍历所有页面。
您的代码非常仔细地遍历了 range(1,3)
中的两个索引值,忽略了您的提取结果。退出该循环后,然后 对 soup
的剩余值进行操作,这是上一个循环的最后一个值。
如果要遍历每个 soup
的内容,则必须缩进第二个循环以使其成为内部循环:
...
threadtitle = soup.find('h1',attrs={"class":"threadtitle"})
for inner in soup.find_all('div', class_="ism-true"):
try:
inner.find('div', class_="panel alt2").extract()
...
这让你继续吗?
如果您想遍历所有页面,您可以定位下一个 link 直到它被禁用。
代码:
urls = ['https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/',
'https://www.f150forum.com/f118/adaptive-cruise-control-sensor-blockage-446041/']
name = []
for url in urls:
with requests.Session() as req:
index = 1
while (True):
# Checking url here
print(url + "index{}/".format(index))
response = req.get(url + "index{}/".format(index))
index = index + 1
soup = BeautifulSoup(response.content, "html.parser")
posts = soup.find(id="posts")
threadtitle = soup.find('h1', attrs={"class": "threadtitle"})
for item in soup.findAll('a', attrs={"class": "bigusername"}):
result = [item.get_text(strip=True, separator=" ")]
name.append(result)
# Check here next link is disable.
if 'disabled' in soup.select_one('a#mb_pagenext').attrs['class'][-1]:
break
print(len(name))
在控制台上,您可以看到它打印了所有页面 url 和总名称计数。
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index1/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index2/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index3/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index4/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index5/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index6/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index7/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index8/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index9/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index10/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index11/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index12/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index13/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index14/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index15/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index16/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index17/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index18/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index19/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index20/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index21/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index22/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index23/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index24/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index25/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index26/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index27/
https://www.f150forum.com/f118/2018-adding-adaptive-cruise-control-415450/index28/
https://www.f150forum.com/f118/adaptive-cruise-control-sensor-blockage-446041/index1/
https://www.f150forum.com/f118/adaptive-cruise-control-sensor-blockage-446041/index2/
280