从 url 获取文本返回空数据框
Getting texts from urls is returning empty dataframe
我正在尝试使用 for 循环从几个网站获取所有段落,但我得到的是一个空数据框。
程序的逻辑是
urls=[]
texts = []
for r in my_list:
try:
# Get text
url = urllib.urlopen(r)
content = url.read()
soup = BeautifulSoup(content, 'lxml')
# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()
texts.append(page)
urls.append(r)
except Exception as e:
print(e)
continue
df = pd.DataFrame({"Urls" : urls, "Texts:" : texts})
urls (my_list) 的示例可能是:https://www.ford.com.au/performance/mustang/ , https://soperth.com.au/perths-best-fish-and-chips-46154, https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html, https://www.bbc.co.uk/programmes/b07d2wy4
如何正确存储特定页面上的链接和文本(所以没有整个网站!)?
预期输出:
Urls Texts
https://www.ford.com.au/performance/mustang/ Nothing else offers the unique combination of classic style and exhilarating performance quite like the Ford Mustang. Whether it’s the Fastback or Convertible, 5.0L V8 or High Performance 2.3L, the Mustang has a heritage few other cars can match.
https://soperth.com.au/perths-best-fish-and-chips-46154
https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html
https://www.bbc.co.uk/programmes/b07d2wy4
在文本中,我应该为每个 url 包含该页面中的段落(即所有
元素)。
即使是虚拟代码(所以不完全是我的代码)也有助于理解我的错误在哪里。我想我目前的错误可能是在这一步:url = urllib.urlopen(r)
因为我没有文本。
我尝试了以下代码(python3:因此是 urllib.request),它有效。在 urlopen 挂起时添加了用户代理。
import pandas as pd
import urllib
from bs4 import BeautifulSoup
urls = []
texts = []
my_list = ["https://www.ford.com.au/performance/mustang/", "https://soperth.com.au/perths-best-fish-and-chips-46154",
"https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html", "https://www.bbc.co.uk/programmes/b07d2wy4"]
for r in my_list:
try:
# Get text
req = urllib.request.Request(
r,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
url = urllib.request.urlopen(req)
content = url.read()
soup = BeautifulSoup(content, 'lxml')
# Find all of the text between paragraph tags and strip out the html
page = ''
for para in soup.find_all('p'):
page += para.get_text()
print(page)
texts.append(page)
urls.append(r)
except Exception as e:
print(e)
continue
df = pd.DataFrame({"Urls": urls, "Texts:": texts})
print(df)
我正在尝试使用 for 循环从几个网站获取所有段落,但我得到的是一个空数据框。 程序的逻辑是
urls=[]
texts = []
for r in my_list:
try:
# Get text
url = urllib.urlopen(r)
content = url.read()
soup = BeautifulSoup(content, 'lxml')
# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()
texts.append(page)
urls.append(r)
except Exception as e:
print(e)
continue
df = pd.DataFrame({"Urls" : urls, "Texts:" : texts})
urls (my_list) 的示例可能是:https://www.ford.com.au/performance/mustang/ , https://soperth.com.au/perths-best-fish-and-chips-46154, https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html, https://www.bbc.co.uk/programmes/b07d2wy4
如何正确存储特定页面上的链接和文本(所以没有整个网站!)?
预期输出:
Urls Texts
https://www.ford.com.au/performance/mustang/ Nothing else offers the unique combination of classic style and exhilarating performance quite like the Ford Mustang. Whether it’s the Fastback or Convertible, 5.0L V8 or High Performance 2.3L, the Mustang has a heritage few other cars can match.
https://soperth.com.au/perths-best-fish-and-chips-46154
https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html
https://www.bbc.co.uk/programmes/b07d2wy4
在文本中,我应该为每个 url 包含该页面中的段落(即所有
元素)。
即使是虚拟代码(所以不完全是我的代码)也有助于理解我的错误在哪里。我想我目前的错误可能是在这一步:url = urllib.urlopen(r)
因为我没有文本。
我尝试了以下代码(python3:因此是 urllib.request),它有效。在 urlopen 挂起时添加了用户代理。
import pandas as pd
import urllib
from bs4 import BeautifulSoup
urls = []
texts = []
my_list = ["https://www.ford.com.au/performance/mustang/", "https://soperth.com.au/perths-best-fish-and-chips-46154",
"https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html", "https://www.bbc.co.uk/programmes/b07d2wy4"]
for r in my_list:
try:
# Get text
req = urllib.request.Request(
r,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
url = urllib.request.urlopen(req)
content = url.read()
soup = BeautifulSoup(content, 'lxml')
# Find all of the text between paragraph tags and strip out the html
page = ''
for para in soup.find_all('p'):
page += para.get_text()
print(page)
texts.append(page)
urls.append(r)
except Exception as e:
print(e)
continue
df = pd.DataFrame({"Urls": urls, "Texts:": texts})
print(df)