使用 BeautifulSoup 从标题标签下的 &quote 中提取数据？

Question

我想在通过 python 中的 BeautifulSoup 库获取 HTML 后提取 link 的标题。基本上，整个标题标签是

 <title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>

我想提取 " 标签中的数据，只有这个 Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3) 我试过

import urllib
import urllib.request

from bs4 import BeautifulSoup

link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
try:
    List=list()
    r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'})
    h = urllib.request.urlopen(r).read()
    data = BeautifulSoup(h,"html.parser")
    for i in data.find_all("title"):
        List.append(i.text)
        print(List[0])
except urllib.error.HTTPError as err:
    pass

我也试过

for i in data.find_all("title.&quot"):

for i in data.find_all("title>&quot"):

for i in data.find_all("&quot"):

和

for i in data.find_all("quot"):

但是没有人在工作。

Answer 1

一旦你解析了 html:

data = BeautifulSoup(h,"html.parser")

这样查找标题：

title = data.find("title").string  # this is without <title> tag

现在在字符串中找到两个引号 (")。有很多方法可以做到这一点。我会使用正则表达式：

import re
match = re.search(r'".*"', title)
if match:
    print match.group(0)

您永远不会搜索 " 或任何其他 &NAME; 序列，因为 BeautifulSoup 会将它们转换为它们代表的实际字符。

编辑：

不捕获引号的正则表达式为：

re.search(r'(?<=").*(?=")', title)

Answer 2

这是一个使用正则表达式提取引号内文本的简单完整示例：

import urllib
import re
from bs4 import BeautifulSoup

link = "https://twitter.com/ImaanZHazir/status/778560899061780481"

r = urllib.request.urlopen(link)
soup = BeautifulSoup(r, "html.parser")
title = soup.title.string
quote = re.match(r'^.*\"(.*)\"', title)
print(quote.group(1))

这里发生的事情是，在获取页面的来源并找到 title 我们对标题使用正则表达式以提取引号内的文本。

我们告诉正则表达式在开始引号 (\") 之前的字符串开头 (^.*) 查找任意数量的符号，然后捕获它和它之间的文本收盘价（第二个 \"）。

然后我们通过告诉 Python 打印第一个捕获的组（正则表达式中括号之间的部分）来打印捕获的文本。

python - https://docs.python.org/3/library/re.html#match-objects

中有更多关于正则表达式匹配的内容

Answer 3

只需在冒号处拆分文本：

In [1]:  h = """<title>Imaan Z Hazir on Twitter: &quot;Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)&quot;</title>"""

In [2]: from bs4 import BeautifulSoup

In [3]: soup  = BeautifulSoup(h, "lxml")

In [4]: print(soup.title.text.split(": ", 1)[1])
 "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

其实看页面根本不需要拆分，正文在div.js里面的p标签里- tweet-text-container, th:

In [8]: import requests

In [9]: from bs4 import BeautifulSoup


In [10]: soup  = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml")


In [11]: print(soup.select_one("div.js-tweet-text-container p").text)
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)

In [12]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"

所以你可以用任何一种方法来获得相同的结果。

使用 BeautifulSoup 从标题标签下的 &quote 中提取数据？

Extract data from &quote under title tag using BeautifulSoup?

python

beautifulsoup

css-selectors

html-parser