下一页在 bs4 和 pandas 数据框中不起作用
Next pages not working in bs4 and pandas dataframe
我正在尝试使用 BeautifulSoup
、CSS
选择器和 Pandas DataFrame
从所有 next pages
获取输出,但只获取第一个页作为输出。你能帮助我吗?
谢谢。
代码:
import requests
import bs4
import pandas as pd
base_url = 'http://quotes.toscrape.com'
res = requests.get("http://quotes.toscrape.com/")
soup = bs4.BeautifulSoup(res.text,'lxml')
all_quote = []
all_author = []
all_tag = []
for element in soup.select('.quote'):
quote = element.select_one('span.text').text
all_quote.append(quote)
author = element.select_one('small.author').text
all_author.append(author)
tag = " , ".join(e.text for e in element.select("a.tag"))
all_tag.append(tag)
next_page = soup.select_one('li.next>a')
if next_page:
next_page = base_url + next_page['href']
else:
pass
df = pd.DataFrame({'all_quote': all_quote,'all_author':all_author, 'all_tag': all_tag})
print(df)
一种方法是创建一个带有您要抓取的参数的递归函数,如果有下一页,则在每次递归调用中更新它们:
import requests
import bs4
import pandas as pd
base_url = 'http://quotes.toscrape.com'
def scraper(url, all_quote=[], all_author=[], all_tag=[]):
global base_url
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text,'lxml')
for element in soup.select('.quote'):
quote = element.select_one('span.text').text
all_quote.append(quote)
author = element.select_one('small.author').text
all_author.append(author)
tag = " , ".join(e.text for e in element.select("a.tag"))
all_tag.append(tag)
next_page = soup.select_one('li.next>a')
if next_page:
next_page = base_url + next_page['href']
return scraper(next_page, all_quote, all_author, all_tag)
else:
return [all_quote, all_author, all_tag]
data = scraper('http://quotes.toscrape.com/')
一旦没有next按钮,function returns相应的参数用他们的最新版本。现在您可以设置 all_quote
、all_author
和 all_tag
,如下所示:
all_quote, all_author, all_tag = data
尝试:
import bs4
import requests
import pandas as pd
base_url = "http://quotes.toscrape.com"
all_quote = []
all_author = []
all_tag = []
url = base_url
while True:
print(url)
res = requests.get(url)
soup = bs4.BeautifulSoup(res.content, "lxml")
for element in soup.select(".quote"):
quote = element.select_one("span.text").text
all_quote.append(quote)
author = element.select_one("small.author").text
all_author.append(author)
tag = " , ".join(e.text for e in element.select("a.tag"))
all_tag.append(tag)
next_page = soup.select_one("li.next>a")
if next_page:
url = base_url + next_page["href"]
else:
break
df = pd.DataFrame(
{"all_quote": all_quote, "all_author": all_author, "all_tag": all_tag}
)
print(df)
df.to_csv("data.csv", sep=";", index=None)
保存 data.csv
(来自 LibreOffice 的屏幕截图):
我正在尝试使用 BeautifulSoup
、CSS
选择器和 Pandas DataFrame
从所有 next pages
获取输出,但只获取第一个页作为输出。你能帮助我吗?
谢谢。
代码:
import requests
import bs4
import pandas as pd
base_url = 'http://quotes.toscrape.com'
res = requests.get("http://quotes.toscrape.com/")
soup = bs4.BeautifulSoup(res.text,'lxml')
all_quote = []
all_author = []
all_tag = []
for element in soup.select('.quote'):
quote = element.select_one('span.text').text
all_quote.append(quote)
author = element.select_one('small.author').text
all_author.append(author)
tag = " , ".join(e.text for e in element.select("a.tag"))
all_tag.append(tag)
next_page = soup.select_one('li.next>a')
if next_page:
next_page = base_url + next_page['href']
else:
pass
df = pd.DataFrame({'all_quote': all_quote,'all_author':all_author, 'all_tag': all_tag})
print(df)
一种方法是创建一个带有您要抓取的参数的递归函数,如果有下一页,则在每次递归调用中更新它们:
import requests
import bs4
import pandas as pd
base_url = 'http://quotes.toscrape.com'
def scraper(url, all_quote=[], all_author=[], all_tag=[]):
global base_url
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text,'lxml')
for element in soup.select('.quote'):
quote = element.select_one('span.text').text
all_quote.append(quote)
author = element.select_one('small.author').text
all_author.append(author)
tag = " , ".join(e.text for e in element.select("a.tag"))
all_tag.append(tag)
next_page = soup.select_one('li.next>a')
if next_page:
next_page = base_url + next_page['href']
return scraper(next_page, all_quote, all_author, all_tag)
else:
return [all_quote, all_author, all_tag]
data = scraper('http://quotes.toscrape.com/')
一旦没有next按钮,function returns相应的参数用他们的最新版本。现在您可以设置 all_quote
、all_author
和 all_tag
,如下所示:
all_quote, all_author, all_tag = data
尝试:
import bs4
import requests
import pandas as pd
base_url = "http://quotes.toscrape.com"
all_quote = []
all_author = []
all_tag = []
url = base_url
while True:
print(url)
res = requests.get(url)
soup = bs4.BeautifulSoup(res.content, "lxml")
for element in soup.select(".quote"):
quote = element.select_one("span.text").text
all_quote.append(quote)
author = element.select_one("small.author").text
all_author.append(author)
tag = " , ".join(e.text for e in element.select("a.tag"))
all_tag.append(tag)
next_page = soup.select_one("li.next>a")
if next_page:
url = base_url + next_page["href"]
else:
break
df = pd.DataFrame(
{"all_quote": all_quote, "all_author": all_author, "all_tag": all_tag}
)
print(df)
df.to_csv("data.csv", sep=";", index=None)
保存 data.csv
(来自 LibreOffice 的屏幕截图):