我如何从讨论区提取用户名、post 和日期 post?
How would I extract username, post, and date posted from discussion board?
我将如何使用 bs4 和 requests 继续这个网络抓取项目?我正在尝试从论坛站点(myfitnesspal 准确地:https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1)提取用户信息,特别是用户名、消息和发布日期,并将它们加载到 csv 上的列中。到目前为止,我有这段代码,但不确定如何继续:
from bs4 import BeautifulSoup
import csv
import requests
# get page source and create a BS object
print('Reading page...')
page= requests.get('https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1')
src = page.content
soup = BeautifulSoup(src, 'html.parser')
#container = soup.select('#vanilla_discussion_index > div.container')
container = soup.select('#vanilla_discussion_index > div.container > div.row > div.content.column > div.CommentsWrap > div.DataBox.DataBox-Comments > ul')
postdata = soup.select('div.Message')
user = []
date = []
text = []
for post in postdata:
text.append(BeautifulSoup(str(post), 'html.parser').get_text().encode('utf-8').strip())
print(text) # this stores the text of each comment/post in a list,
# so next I'd want to store this in a csv with columns
# user, date posted, post with this under the post column
# and do the same for user and date
此脚本将从页面获取所有消息并将它们保存在 data.csv
:
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for u, d, m in zip(soup.select('.Username'), soup.select('.DateCreated'), soup.select('.Message')):
all_data.append([u.text, d.get_text(strip=True),m.get_text(strip=True, separator='\n')])
with open('data.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in all_data:
writer.writerow(row)
来自 LibreOffice 的屏幕截图:
我在网络抓取方面喜欢遵循的一条经验法则是尽可能具体,不要收集不必要的信息。因此,例如,如果我想 select 一个用户名,我检查包含我需要的信息的元素:
<a class="Username" href="...">Username</a>
由于我正在尝试收集用户名,因此 select 通过 class“用户名”:
最有意义
soup.select("a.Username")
这给了我在页面上找到的所有用户名的列表,这很好,但是,如果我们想要 select“包”中的数据(通过 post您的示例我们需要单独收集每个 post。
要完成此操作,您可以执行以下操作:
comments = soup.select("div.comment")
这样可以更轻松地执行以下操作:
with open('file.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['user', 'date', 'text']
for comment in comments:
username = comment.select_one("div.Username")
date = comment.select_one("span.BodyDate")
message = comment.select_one("div.Message")
writer.writerow([username, date, message])
这样做还可以确保您的数据保持有序,即使某个元素丢失也是如此。
给你:
from bs4 import BeautifulSoup
import csv
import requests
page= requests.get('https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1')
soup = BeautifulSoup(page.content, 'html.parser')
container = soup.select('#vanilla_discussion_index > div.container > div.row > div.content.column > div.CommentsWrap > div.DataBox.DataBox-Comments > ul > li')
with open('data.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=['user', 'date', 'text'])
writer.writeheader()
for comment in container:
writer.writerow({
'user': comment.find('a', {'class': 'Username'}).get_text(),
'date': comment.find('span', {'class': 'BodyDate DateCreated'}).get_text().strip(),
'text': comment.find('div', {'class': 'Message'}).get_text().strip()
})
我将如何使用 bs4 和 requests 继续这个网络抓取项目?我正在尝试从论坛站点(myfitnesspal 准确地:https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1)提取用户信息,特别是用户名、消息和发布日期,并将它们加载到 csv 上的列中。到目前为止,我有这段代码,但不确定如何继续:
from bs4 import BeautifulSoup
import csv
import requests
# get page source and create a BS object
print('Reading page...')
page= requests.get('https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1')
src = page.content
soup = BeautifulSoup(src, 'html.parser')
#container = soup.select('#vanilla_discussion_index > div.container')
container = soup.select('#vanilla_discussion_index > div.container > div.row > div.content.column > div.CommentsWrap > div.DataBox.DataBox-Comments > ul')
postdata = soup.select('div.Message')
user = []
date = []
text = []
for post in postdata:
text.append(BeautifulSoup(str(post), 'html.parser').get_text().encode('utf-8').strip())
print(text) # this stores the text of each comment/post in a list,
# so next I'd want to store this in a csv with columns
# user, date posted, post with this under the post column
# and do the same for user and date
此脚本将从页面获取所有消息并将它们保存在 data.csv
:
import csv
import requests
from bs4 import BeautifulSoup
url = 'https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for u, d, m in zip(soup.select('.Username'), soup.select('.DateCreated'), soup.select('.Message')):
all_data.append([u.text, d.get_text(strip=True),m.get_text(strip=True, separator='\n')])
with open('data.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in all_data:
writer.writerow(row)
来自 LibreOffice 的屏幕截图:
我在网络抓取方面喜欢遵循的一条经验法则是尽可能具体,不要收集不必要的信息。因此,例如,如果我想 select 一个用户名,我检查包含我需要的信息的元素:
<a class="Username" href="...">Username</a>
由于我正在尝试收集用户名,因此 select 通过 class“用户名”:
最有意义soup.select("a.Username")
这给了我在页面上找到的所有用户名的列表,这很好,但是,如果我们想要 select“包”中的数据(通过 post您的示例我们需要单独收集每个 post。
要完成此操作,您可以执行以下操作:
comments = soup.select("div.comment")
这样可以更轻松地执行以下操作:
with open('file.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['user', 'date', 'text']
for comment in comments:
username = comment.select_one("div.Username")
date = comment.select_one("span.BodyDate")
message = comment.select_one("div.Message")
writer.writerow([username, date, message])
这样做还可以确保您的数据保持有序,即使某个元素丢失也是如此。
给你:
from bs4 import BeautifulSoup
import csv
import requests
page= requests.get('https://community.myfitnesspal.com/en/discussion/10703170/what-were-eating/p1')
soup = BeautifulSoup(page.content, 'html.parser')
container = soup.select('#vanilla_discussion_index > div.container > div.row > div.content.column > div.CommentsWrap > div.DataBox.DataBox-Comments > ul > li')
with open('data.csv', 'w') as f:
writer = csv.DictWriter(f, fieldnames=['user', 'date', 'text'])
writer.writeheader()
for comment in container:
writer.writerow({
'user': comment.find('a', {'class': 'Username'}).get_text(),
'date': comment.find('span', {'class': 'BodyDate DateCreated'}).get_text().strip(),
'text': comment.find('div', {'class': 'Message'}).get_text().strip()
})