从 url 获取文本返回空数据框

Question

我正在尝试使用 for 循环从几个网站获取所有段落，但我得到的是一个空数据框。程序的逻辑是

urls=[]
texts = []        

for r in my_list:
                try:
                    # Get text
                    url = urllib.urlopen(r)
                    content = url.read()
                    soup = BeautifulSoup(content, 'lxml')
                    # Find all of the text between paragraph tags and strip out the html
                    page = soup.find('p').getText()
                    texts.append(page)
                    urls.append(r)
                    
                except Exception as e:
                    print(e)
                    continue

df = pd.DataFrame({"Urls" : urls, "Texts:" : texts})

urls (my_list) 的示例可能是：https://www.ford.com.au/performance/mustang/ , https://soperth.com.au/perths-best-fish-and-chips-46154, https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html, https://www.bbc.co.uk/programmes/b07d2wy4

如何正确存储特定页面上的链接和文本（所以没有整个网站！）？

预期输出：

Urls                                                       Texts

https://www.ford.com.au/performance/mustang/         Nothing else offers the unique combination of classic style and exhilarating performance quite like the Ford Mustang. Whether it’s the Fastback or Convertible, 5.0L V8 or High Performance 2.3L, the Mustang has a heritage few other cars can match.
https://soperth.com.au/perths-best-fish-and-chips-46154 
https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html 
https://www.bbc.co.uk/programmes/b07d2wy4

在文本中，我应该为每个 url 包含该页面中的段落（即所有

元素）。即使是虚拟代码（所以不完全是我的代码）也有助于理解我的错误在哪里。我想我目前的错误可能是在这一步：url = urllib.urlopen(r) 因为我没有文本。

Answer 1

我尝试了以下代码（python3：因此是 urllib.request），它有效。在 urlopen 挂起时添加了用户代理。

import pandas as pd
import urllib
from bs4 import BeautifulSoup

urls = []
texts = []
my_list = ["https://www.ford.com.au/performance/mustang/", "https://soperth.com.au/perths-best-fish-and-chips-46154",
           "https://www.tripadvisor.com.au/Restaurants-g255103-zfd10901-Perth_Greater_Perth_Western_Australia-Fish_and_Chips.html", "https://www.bbc.co.uk/programmes/b07d2wy4"]

for r in my_list:
    try:
        # Get text
        req = urllib.request.Request(
            r,
            data=None,
            headers={
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
            }
        )
        url = urllib.request.urlopen(req)
        content = url.read()
        soup = BeautifulSoup(content, 'lxml')

        # Find all of the text between paragraph tags and strip out the html
        page = ''
        for para in soup.find_all('p'):
            page += para.get_text()
        print(page)
        texts.append(page)
        urls.append(r)
    except Exception as e:
        print(e)
        continue

df = pd.DataFrame({"Urls": urls, "Texts:": texts})
print(df)

从 url 获取文本返回空数据框

Getting texts from urls is returning empty dataframe

python

urllib

beautifulsoup

web-scraping