网络抓取时如何将数据写入csv中的新列?
How to write data to new columns in csv when webscraping?
我正在抓取广告牌热 r&b/hip 跃点图表,我能够获取所有数据,但是当我开始将数据写入 csv 时,格式完全错误。
上周数字、峰值位置和图表周数的数据全部出现在我的 csv 的前 3 列下,而不是相应 headers 所在的列。
这是我当前的代码:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.billboard.com/charts/r-b-hip-hop-songs'
# Opens web connetion and grabs page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# HTML parsing
page_soup = soup(page_html, "html.parser")
# Grabs song title, artist and picture
mainContainer = page_soup.findAll("div", {"class":"chart-row__main-
display"})
# CSV filename creation
filename = "Billboard_Hip_Hop_Charts.csv"
f = open(filename, "w")
# Creating Headers
headers = "Billboard Number, Artist Name, Song Title, Last Week Number, Peak
Position, Weeks On Chart\n"
f.write(headers)
# Get Billboard Number, Artist Name and Song Title
for container in mainContainer:
# Gets billboard number
billboard_number = container.div.span.text
# Gets artist name
artist_name_a_tag = container.findAll("", {"class":"chart-row__artist"})
artist_name = artist_name_a_tag[0].text.strip()
# Gets song title
song_title = container.h2.text
print("Billboard Number: " + billboard_number)
print("Artist Name: " + artist_name)
print("Song Title: " + song_title)
f.write(billboard_number + "," + artist_name + "," + song_title + "\n")
# Grabs side container from main container
secondaryContainer = page_soup.findAll("div", {"class":"chart-row__secondary"})
# Get Last Week Number, Peak Position and Weeks On Chart
for container in secondaryContainer:
# Gets last week number
last_week_number_tag = container.findAll("", {"class":"chart-row__value"})
last_week_number = last_week_number_tag[0].text
# Gets peak position
peak_position_tag = container.findAll("", {"class":"chart-row__value"})
peak_position = peak_position_tag[1].text
# Gets week on chart
weeks_on_chart_tag = container.findAll("", {"class":"chart-row__value"})
weeks_on_chart = weeks_on_chart_tag[2].text
print("Last Week Number: " + last_week_number)
print("Peak Position: " + peak_position)
print("Weeks On Chart: " + weeks_on_chart)
f.write(last_week_number + "," + peak_position + "," + weeks_on_chart + "\n")
f.close()
这是我的 csv 看起来像 headers 广告牌编号、艺术家姓名、歌曲标题、上周数字、峰值位置和图表上的周数。
1 Drake Nice For What
2 Post Malone Featuring Ty Dolla $ign Psycho
3 Drake God's Plan
4 Post Malone Better Now
5 Post Malone Featuring 21 Savage Rockstar
6 BlocBoy JB Featuring Drake Look Alive
7 Post Malone Paranoid
8 Lil Dicky Featuring Chris Brown Freaky Friday
9 Post Malone Rich & Sad
10 Post Malone Featuring Swae Lee Spoil My Night
11 Post Malone Featuring Nicki Minaj Ball For Me
12 Migos Featuring Drake Walk It Talk It
13 Post Malone Featuring G-Eazy & YG Same Bitches
14 Cardi B| Bad Bunny & J Balvin I Like It
15 Post Malone Zack And Codeine
16 Post Malone Over Now
17 Cardi B Be Careful
18 Post Malone Takin' Shots
19 The Weeknd & Kendrick Lamar Pray For Me
20 Rich The Kid Plug Walk
21 The Weeknd Call Out My Name
22 Bruno Mars & Cardi B Finesse
23 Post Malone Candy Paint
24 Ella Mai Boo'd Up
25 Rae Sremmurd & Juicy J Powerglide
26 Post Malone 92 Explorer
27 J. Cole ATM
28 J. Cole KOD
29 Post Malone Otherside
30 Post Malone Blame It On Me
31 J. Cole Kevin's Heart
32 Kendrick Lamar & SZA All The Stars
33 Nicki Minaj Chun-Li
34 Lil Pump Esskeetit
35 Migos Stir Fry
36 Famous Dex Japan
37 Post Malone Sugar Wraith
38 Cardi B Featuring Migos Drip
39 XXXTENTACION Sad!
40 Jay Rock| Kendrick Lamar| Future & James Blake King's Dead
41 Rich The Kid Featuring Kendrick Lamar New Freezer
42 Logic & Marshmello Everyday
43 J. Cole Motiv8
44 YoungBoy Never Broke Again Outside Today
45 Post Malone Jonestown (Interlude)
46 Cardi B Featuring 21 Savage Bartier Cardi
47 YoungBoy Never Broke Again Overdose
48 J. Cole 1985 (Intro To The Fall Off)
49 J. Cole Photograph
50 Khalid| Ty Dolla $ign & 6LACK OTW
1 1 2
2 1 6
3 1 17
4 2 12
5 3 14
10 6 8
...
任何有关将数据放入正确列的帮助都有帮助!
您的代码不必要地混乱并且很难阅读。您根本不需要创建两个容器,因为一个容器足以获取所需的数据。尝试以下方式并找到相应填充数据的 csv:
import requests, csv
from bs4 import BeautifulSoup
url = 'https://www.billboard.com/charts/r-b-hip-hop-songs'
with open('Billboard_Hip_Hop_Charts.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Billboard Number','Artist Name','Song Title','Last Week Number','peak_position','weeks_on_chart'])
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
for container in soup.find_all("article",class_="chart-row"):
billboard_number = container.find(class_="chart-row__current-week").text
artist_name_a_tag = container.find(class_="chart-row__artist").text.strip()
song_title = container.find(class_="chart-row__song").text
last_week_number_tag = container.find(class_="chart-row__value")
last_week_number = last_week_number_tag.text
peak_position_tag = last_week_number_tag.find_parent().find_next_sibling().find(class_="chart-row__value")
peak_position = peak_position_tag.text
weeks_on_chart_tag = peak_position_tag.find_parent().find_next_sibling().find(class_="chart-row__value").text
print(billboard_number,artist_name_a_tag,song_title,last_week_number,peak_position,weeks_on_chart_tag)
writer.writerow([billboard_number,artist_name_a_tag,song_title,last_week_number,peak_position,weeks_on_chart_tag])
输出如下:
1 Childish Gambino This Is America 1 1 2
2 Drake Nice For What 2 1 6
3 Drake God's Plan 3 1 17
4 Post Malone Featuring Ty Dolla $ign Psycho 4 2 12
5 BlocBoy JB Featuring Drake Look Alive 5 3 14
6 Ella Mai Boo'd Up 10 6 8
我正在抓取广告牌热 r&b/hip 跃点图表,我能够获取所有数据,但是当我开始将数据写入 csv 时,格式完全错误。
上周数字、峰值位置和图表周数的数据全部出现在我的 csv 的前 3 列下,而不是相应 headers 所在的列。
这是我当前的代码:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.billboard.com/charts/r-b-hip-hop-songs'
# Opens web connetion and grabs page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# HTML parsing
page_soup = soup(page_html, "html.parser")
# Grabs song title, artist and picture
mainContainer = page_soup.findAll("div", {"class":"chart-row__main-
display"})
# CSV filename creation
filename = "Billboard_Hip_Hop_Charts.csv"
f = open(filename, "w")
# Creating Headers
headers = "Billboard Number, Artist Name, Song Title, Last Week Number, Peak
Position, Weeks On Chart\n"
f.write(headers)
# Get Billboard Number, Artist Name and Song Title
for container in mainContainer:
# Gets billboard number
billboard_number = container.div.span.text
# Gets artist name
artist_name_a_tag = container.findAll("", {"class":"chart-row__artist"})
artist_name = artist_name_a_tag[0].text.strip()
# Gets song title
song_title = container.h2.text
print("Billboard Number: " + billboard_number)
print("Artist Name: " + artist_name)
print("Song Title: " + song_title)
f.write(billboard_number + "," + artist_name + "," + song_title + "\n")
# Grabs side container from main container
secondaryContainer = page_soup.findAll("div", {"class":"chart-row__secondary"})
# Get Last Week Number, Peak Position and Weeks On Chart
for container in secondaryContainer:
# Gets last week number
last_week_number_tag = container.findAll("", {"class":"chart-row__value"})
last_week_number = last_week_number_tag[0].text
# Gets peak position
peak_position_tag = container.findAll("", {"class":"chart-row__value"})
peak_position = peak_position_tag[1].text
# Gets week on chart
weeks_on_chart_tag = container.findAll("", {"class":"chart-row__value"})
weeks_on_chart = weeks_on_chart_tag[2].text
print("Last Week Number: " + last_week_number)
print("Peak Position: " + peak_position)
print("Weeks On Chart: " + weeks_on_chart)
f.write(last_week_number + "," + peak_position + "," + weeks_on_chart + "\n")
f.close()
这是我的 csv 看起来像 headers 广告牌编号、艺术家姓名、歌曲标题、上周数字、峰值位置和图表上的周数。
1 Drake Nice For What
2 Post Malone Featuring Ty Dolla $ign Psycho
3 Drake God's Plan
4 Post Malone Better Now
5 Post Malone Featuring 21 Savage Rockstar
6 BlocBoy JB Featuring Drake Look Alive
7 Post Malone Paranoid
8 Lil Dicky Featuring Chris Brown Freaky Friday
9 Post Malone Rich & Sad
10 Post Malone Featuring Swae Lee Spoil My Night
11 Post Malone Featuring Nicki Minaj Ball For Me
12 Migos Featuring Drake Walk It Talk It
13 Post Malone Featuring G-Eazy & YG Same Bitches
14 Cardi B| Bad Bunny & J Balvin I Like It
15 Post Malone Zack And Codeine
16 Post Malone Over Now
17 Cardi B Be Careful
18 Post Malone Takin' Shots
19 The Weeknd & Kendrick Lamar Pray For Me
20 Rich The Kid Plug Walk
21 The Weeknd Call Out My Name
22 Bruno Mars & Cardi B Finesse
23 Post Malone Candy Paint
24 Ella Mai Boo'd Up
25 Rae Sremmurd & Juicy J Powerglide
26 Post Malone 92 Explorer
27 J. Cole ATM
28 J. Cole KOD
29 Post Malone Otherside
30 Post Malone Blame It On Me
31 J. Cole Kevin's Heart
32 Kendrick Lamar & SZA All The Stars
33 Nicki Minaj Chun-Li
34 Lil Pump Esskeetit
35 Migos Stir Fry
36 Famous Dex Japan
37 Post Malone Sugar Wraith
38 Cardi B Featuring Migos Drip
39 XXXTENTACION Sad!
40 Jay Rock| Kendrick Lamar| Future & James Blake King's Dead
41 Rich The Kid Featuring Kendrick Lamar New Freezer
42 Logic & Marshmello Everyday
43 J. Cole Motiv8
44 YoungBoy Never Broke Again Outside Today
45 Post Malone Jonestown (Interlude)
46 Cardi B Featuring 21 Savage Bartier Cardi
47 YoungBoy Never Broke Again Overdose
48 J. Cole 1985 (Intro To The Fall Off)
49 J. Cole Photograph
50 Khalid| Ty Dolla $ign & 6LACK OTW
1 1 2
2 1 6
3 1 17
4 2 12
5 3 14
10 6 8
...
任何有关将数据放入正确列的帮助都有帮助!
您的代码不必要地混乱并且很难阅读。您根本不需要创建两个容器,因为一个容器足以获取所需的数据。尝试以下方式并找到相应填充数据的 csv:
import requests, csv
from bs4 import BeautifulSoup
url = 'https://www.billboard.com/charts/r-b-hip-hop-songs'
with open('Billboard_Hip_Hop_Charts.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Billboard Number','Artist Name','Song Title','Last Week Number','peak_position','weeks_on_chart'])
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
for container in soup.find_all("article",class_="chart-row"):
billboard_number = container.find(class_="chart-row__current-week").text
artist_name_a_tag = container.find(class_="chart-row__artist").text.strip()
song_title = container.find(class_="chart-row__song").text
last_week_number_tag = container.find(class_="chart-row__value")
last_week_number = last_week_number_tag.text
peak_position_tag = last_week_number_tag.find_parent().find_next_sibling().find(class_="chart-row__value")
peak_position = peak_position_tag.text
weeks_on_chart_tag = peak_position_tag.find_parent().find_next_sibling().find(class_="chart-row__value").text
print(billboard_number,artist_name_a_tag,song_title,last_week_number,peak_position,weeks_on_chart_tag)
writer.writerow([billboard_number,artist_name_a_tag,song_title,last_week_number,peak_position,weeks_on_chart_tag])
输出如下:
1 Childish Gambino This Is America 1 1 2
2 Drake Nice For What 2 1 6
3 Drake God's Plan 3 1 17
4 Post Malone Featuring Ty Dolla $ign Psycho 4 2 12
5 BlocBoy JB Featuring Drake Look Alive 5 3 14
6 Ella Mai Boo'd Up 10 6 8