用 BeautifulSoup 抓取 URL 循环
Scrape URL loop with BeautifulSoup
我想在同一站点的不同页面上抓取信息,societe.com我有几个问题。
首先是我成功完成的代码,我承认我有点菜鸟
我只放了 2 个 URL 来查看循环是否有效和一些信息,当一切正常时我可以添加一些
urls = ["https://www.societe.com/societe/decathlon-france-500569405.html","https://www.societe.com/societe/go-sport-312193899.html"]
for url in urls:
response = requests.get(url, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
soup = BeautifulSoup(response.text, "html.parser")
numrcs = soup.find("td", class_="numdisplay")
nomcommercial = soup.find("td", class_="break-word")
print(nomcommercial.text)
print(numrcs.text.strip())
numsiret = soup.select('div[id^=siret_number]')
for div in numsiret:
print(div.text.strip())
formejuri = soup.select('div[id^=catjur-histo-description]')
for div in formejuri:
print(div.text.strip())
infosend = {
'numrcs': numrcs,
'nomcommercial':nomcommercial,
'numsiret':numsiret,
'formejuri':formejuri
}
tableau.append(infosend)
print(tableau)
my_infos = ['Numéro RCS', 'Numéro Siret ','Forme Juridique']
my_columns = [
np.tile(np.array(my_infos), len(nomcommercial))
]
df = pd.DataFrame( tableau,index=nomcommercial, columns=my_columns)
df
当我 运行 循环时,我得到了正确的信息,例如
DECATHLON FRANCE
Lille Metropole B 500569405
50056940503239
SASU Société par actions simplifiée à associé unique
但我想将所有这些信息放在一个 table 中,但我真的不能,只有最后一家公司出现并且数据没有意义我尝试按照教程进行操作但没有成功。
如果你能帮助我,我会很高兴
要获取有关公司的数据,您可以使用下一个示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
"https://www.societe.com/societe/decathlon-france-500569405.html",
"https://www.societe.com/societe/go-sport-312193899.html",
]
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36"
}
data = []
for url in urls:
soup = BeautifulSoup(
requests.get(url, headers=headers).content, "html.parser"
)
title = soup.select_one("#identite_deno").get_text(strip=True)
rcs = soup.select_one('td:-soup-contains("Numéro RCS") + td').get_text(
strip=True
)
siret_number = soup.select_one("#siret_number").get_text(strip=True)
form = soup.select_one("#catjur-histo-description").get_text(strip=True)
data.append([title, url, rcs, siret_number, form])
df = pd.DataFrame(
data,
columns=["Title", "URL", "Numéro RCS", "Numéro Siret", "Forme Juridique"],
)
print(df.to_markdown())
打印:
Title
URL
Numéro RCS
Numéro Siret
Forme Juridique
0
DECATHLON FRANCE (DECATHLON DIRECTION GENERALE FRANCE)
https://www.societe.com/societe/decathlon-france-500569405.html
Lille Metropole B 500569405
50056940503239
SASU Société par actions simplifiée à associé unique
1
GO SPORT
https://www.societe.com/societe/go-sport-312193899.html
Grenoble B 312193899
31219389900191
Société par actions simplifiée
我想在同一站点的不同页面上抓取信息,societe.com我有几个问题。
首先是我成功完成的代码,我承认我有点菜鸟
我只放了 2 个 URL 来查看循环是否有效和一些信息,当一切正常时我可以添加一些
urls = ["https://www.societe.com/societe/decathlon-france-500569405.html","https://www.societe.com/societe/go-sport-312193899.html"]
for url in urls:
response = requests.get(url, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
soup = BeautifulSoup(response.text, "html.parser")
numrcs = soup.find("td", class_="numdisplay")
nomcommercial = soup.find("td", class_="break-word")
print(nomcommercial.text)
print(numrcs.text.strip())
numsiret = soup.select('div[id^=siret_number]')
for div in numsiret:
print(div.text.strip())
formejuri = soup.select('div[id^=catjur-histo-description]')
for div in formejuri:
print(div.text.strip())
infosend = {
'numrcs': numrcs,
'nomcommercial':nomcommercial,
'numsiret':numsiret,
'formejuri':formejuri
}
tableau.append(infosend)
print(tableau)
my_infos = ['Numéro RCS', 'Numéro Siret ','Forme Juridique']
my_columns = [
np.tile(np.array(my_infos), len(nomcommercial))
]
df = pd.DataFrame( tableau,index=nomcommercial, columns=my_columns)
df
当我 运行 循环时,我得到了正确的信息,例如
DECATHLON FRANCE
Lille Metropole B 500569405
50056940503239
SASU Société par actions simplifiée à associé unique
但我想将所有这些信息放在一个 table 中,但我真的不能,只有最后一家公司出现并且数据没有意义我尝试按照教程进行操作但没有成功。
如果你能帮助我,我会很高兴
要获取有关公司的数据,您可以使用下一个示例:
import requests
import pandas as pd
from bs4 import BeautifulSoup
urls = [
"https://www.societe.com/societe/decathlon-france-500569405.html",
"https://www.societe.com/societe/go-sport-312193899.html",
]
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36"
}
data = []
for url in urls:
soup = BeautifulSoup(
requests.get(url, headers=headers).content, "html.parser"
)
title = soup.select_one("#identite_deno").get_text(strip=True)
rcs = soup.select_one('td:-soup-contains("Numéro RCS") + td').get_text(
strip=True
)
siret_number = soup.select_one("#siret_number").get_text(strip=True)
form = soup.select_one("#catjur-histo-description").get_text(strip=True)
data.append([title, url, rcs, siret_number, form])
df = pd.DataFrame(
data,
columns=["Title", "URL", "Numéro RCS", "Numéro Siret", "Forme Juridique"],
)
print(df.to_markdown())
打印:
Title | URL | Numéro RCS | Numéro Siret | Forme Juridique | |
---|---|---|---|---|---|
0 | DECATHLON FRANCE (DECATHLON DIRECTION GENERALE FRANCE) | https://www.societe.com/societe/decathlon-france-500569405.html | Lille Metropole B 500569405 | 50056940503239 | SASU Société par actions simplifiée à associé unique |
1 | GO SPORT | https://www.societe.com/societe/go-sport-312193899.html | Grenoble B 312193899 | 31219389900191 | Société par actions simplifiée |