(网络抓取)我找到了正确的标签,现在我该如何提取文本?
(Web scraping) I've located the proper tags, now how do I extract the text?
我正在创建我的第一个网络抓取应用程序,用于收集当前在 https://store.steampowered.com/ 的 "new and trending" 选项卡上的游戏名称。弄清楚如何执行此操作后,我想对价格重复该过程,并将两者导出到电子表格中的单独列中。
我已成功找到包含我要提取的文本(标题)的标签,但我不确定在找到它们的容器后如何提取标题。
from urllib.request import urlopen
from bs4 import BeautifulSoup
my_url = 'https://store.steampowered.com/'
uClient = urlopen(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"tab_item_name"}, limit=10)
for titles in containers:
print(titles)
我想做的是使用 for 循环在垂直列表中打印出现在 Steam 主页上的 10 个游戏的名称。实际发生的是我打印出包含标题的标签:
<div class="tab_item_name">Destiny 2: Shadowkeep</div>
<div class="tab_item_name">Destiny 2</div>
<div class="tab_item_name">Destiny 2: Forsaken</div>
<div class="tab_item_name">Destiny 2: Shadowkeep Digital Deluxe Edition</div>
<div class="tab_item_name">NGU IDLE</div>
<div class="tab_item_name">Kaede the Eliminator / Eliminator 小枫</div>
<div class="tab_item_name">Spaceland</div>
<div class="tab_item_name">Cube World</div>
<div class="tab_item_name">Aokana - Four Rhythms Across the Blue</div>
<div class="tab_item_name">CODE VEIN</div>
If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.
那就这样吧:
# Should be `title` IMO, because you are currently handling a single title
for titles in containers:
print(titles.get_text())
使用 titles.text
或者,甚至 titles.get_text()
任何你喜欢的标题文本如下:
from urllib.request import urlopen
from bs4 import BeautifulSoup
my_url = 'https://store.steampowered.com/'
uClient = urlopen(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"tab_item_name"}, limit=11)
for titles in containers:
print(titles.text)
另一种非常方便的方法是使用lxml
import requests
import lxml.html
url = 'https://store.steampowered.com/'
# Make the request
response = requests.get(url=url, timeout=5)
# Parse tree
tree = lxml.html.fromstring(response.text)
# Select section corresponding to new games
sub_tree = tree.get_element_by_id('tab_newreleases_content')
# Extract data
games_list = [a.text_content() for a in sub_tree.find_class('tab_item_name')]
# Check
for game in games_list[:11]:
print(game)
# Destiny 2: Shadowkeep
# Destiny 2
# Destiny 2: Forsaken
# Destiny 2: Shadowkeep Digital Deluxe Edition
# NGU IDLE
# Fernbus Simulator - MAN Lion's Intercity
# Euro Truck Simulator 2 - Pink Ribbon Charity Pack
# Spaceland
# Cube World
# CODE VEIN
# CODE VEIN
我正在创建我的第一个网络抓取应用程序,用于收集当前在 https://store.steampowered.com/ 的 "new and trending" 选项卡上的游戏名称。弄清楚如何执行此操作后,我想对价格重复该过程,并将两者导出到电子表格中的单独列中。
我已成功找到包含我要提取的文本(标题)的标签,但我不确定在找到它们的容器后如何提取标题。
from urllib.request import urlopen
from bs4 import BeautifulSoup
my_url = 'https://store.steampowered.com/'
uClient = urlopen(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"tab_item_name"}, limit=10)
for titles in containers:
print(titles)
我想做的是使用 for 循环在垂直列表中打印出现在 Steam 主页上的 10 个游戏的名称。实际发生的是我打印出包含标题的标签:
<div class="tab_item_name">Destiny 2: Shadowkeep</div>
<div class="tab_item_name">Destiny 2</div>
<div class="tab_item_name">Destiny 2: Forsaken</div>
<div class="tab_item_name">Destiny 2: Shadowkeep Digital Deluxe Edition</div>
<div class="tab_item_name">NGU IDLE</div>
<div class="tab_item_name">Kaede the Eliminator / Eliminator 小枫</div>
<div class="tab_item_name">Spaceland</div>
<div class="tab_item_name">Cube World</div>
<div class="tab_item_name">Aokana - Four Rhythms Across the Blue</div>
<div class="tab_item_name">CODE VEIN</div>
If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.
那就这样吧:
# Should be `title` IMO, because you are currently handling a single title
for titles in containers:
print(titles.get_text())
使用 titles.text
或者,甚至 titles.get_text()
任何你喜欢的标题文本如下:
from urllib.request import urlopen
from bs4 import BeautifulSoup
my_url = 'https://store.steampowered.com/'
uClient = urlopen(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"tab_item_name"}, limit=11)
for titles in containers:
print(titles.text)
另一种非常方便的方法是使用lxml
import requests
import lxml.html
url = 'https://store.steampowered.com/'
# Make the request
response = requests.get(url=url, timeout=5)
# Parse tree
tree = lxml.html.fromstring(response.text)
# Select section corresponding to new games
sub_tree = tree.get_element_by_id('tab_newreleases_content')
# Extract data
games_list = [a.text_content() for a in sub_tree.find_class('tab_item_name')]
# Check
for game in games_list[:11]:
print(game)
# Destiny 2: Shadowkeep
# Destiny 2
# Destiny 2: Forsaken
# Destiny 2: Shadowkeep Digital Deluxe Edition
# NGU IDLE
# Fernbus Simulator - MAN Lion's Intercity
# Euro Truck Simulator 2 - Pink Ribbon Charity Pack
# Spaceland
# Cube World
# CODE VEIN
# CODE VEIN