使用 BeautifulSoup 从标题标签下的 "e 中提取数据?
Extract data from "e under title tag using BeautifulSoup?
我想在通过 python 中的 BeautifulSoup
库获取 HTML 后提取 link 的标题。
基本上,整个标题标签是
<title>Imaan Z Hazir on Twitter: "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"</title>
我想提取 " 标签中的数据,只有这个 Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)
我试过
import urllib
import urllib.request
from bs4 import BeautifulSoup
link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
try:
List=list()
r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'})
h = urllib.request.urlopen(r).read()
data = BeautifulSoup(h,"html.parser")
for i in data.find_all("title"):
List.append(i.text)
print(List[0])
except urllib.error.HTTPError as err:
pass
我也试过
for i in data.find_all("title.""):
for i in data.find_all("title>""):
for i in data.find_all("""):
和
for i in data.find_all("quot"):
但是没有人在工作。
一旦你解析了 html:
data = BeautifulSoup(h,"html.parser")
这样查找标题:
title = data.find("title").string # this is without <title> tag
现在在字符串中找到两个引号 ("
)。有很多方法可以做到这一点。我会使用正则表达式:
import re
match = re.search(r'".*"', title)
if match:
print match.group(0)
您永远不会搜索 "
或任何其他 &NAME;
序列,因为 BeautifulSoup 会将它们转换为它们代表的实际字符。
编辑:
不捕获引号的正则表达式为:
re.search(r'(?<=").*(?=")', title)
这是一个使用正则表达式提取引号内文本的简单完整示例:
import urllib
import re
from bs4 import BeautifulSoup
link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
r = urllib.request.urlopen(link)
soup = BeautifulSoup(r, "html.parser")
title = soup.title.string
quote = re.match(r'^.*\"(.*)\"', title)
print(quote.group(1))
这里发生的事情是,在获取页面的来源并找到 title
我们对标题使用正则表达式以提取引号内的文本。
我们告诉正则表达式在开始引号 (\"
) 之前的字符串开头 (^.*
) 查找任意数量的符号,然后捕获它和它之间的文本收盘价(第二个 \"
)。
然后我们通过告诉 Python 打印第一个捕获的组(正则表达式中括号之间的部分)来打印捕获的文本。
python - https://docs.python.org/3/library/re.html#match-objects
中有更多关于正则表达式匹配的内容
只需在冒号处拆分文本:
In [1]: h = """<title>Imaan Z Hazir on Twitter: "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"</title>"""
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(h, "lxml")
In [4]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"
其实看页面根本不需要拆分,正文在div.js里面的p标签里- tweet-text-container, th:
In [8]: import requests
In [9]: from bs4 import BeautifulSoup
In [10]: soup = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml")
In [11]: print(soup.select_one("div.js-tweet-text-container p").text)
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)
In [12]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"
所以你可以用任何一种方法来获得相同的结果。
我想在通过 python 中的 BeautifulSoup
库获取 HTML 后提取 link 的标题。
基本上,整个标题标签是
<title>Imaan Z Hazir on Twitter: "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"</title>
我想提取 " 标签中的数据,只有这个 Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)
我试过
import urllib
import urllib.request
from bs4 import BeautifulSoup
link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
try:
List=list()
r = urllib.request.Request(link, headers={'User-Agent': 'Chrome/51.0.2704.103'})
h = urllib.request.urlopen(r).read()
data = BeautifulSoup(h,"html.parser")
for i in data.find_all("title"):
List.append(i.text)
print(List[0])
except urllib.error.HTTPError as err:
pass
我也试过
for i in data.find_all("title.""):
for i in data.find_all("title>""):
for i in data.find_all("""):
和
for i in data.find_all("quot"):
但是没有人在工作。
一旦你解析了 html:
data = BeautifulSoup(h,"html.parser")
这样查找标题:
title = data.find("title").string # this is without <title> tag
现在在字符串中找到两个引号 ("
)。有很多方法可以做到这一点。我会使用正则表达式:
import re
match = re.search(r'".*"', title)
if match:
print match.group(0)
您永远不会搜索 "
或任何其他 &NAME;
序列,因为 BeautifulSoup 会将它们转换为它们代表的实际字符。
编辑:
不捕获引号的正则表达式为:
re.search(r'(?<=").*(?=")', title)
这是一个使用正则表达式提取引号内文本的简单完整示例:
import urllib
import re
from bs4 import BeautifulSoup
link = "https://twitter.com/ImaanZHazir/status/778560899061780481"
r = urllib.request.urlopen(link)
soup = BeautifulSoup(r, "html.parser")
title = soup.title.string
quote = re.match(r'^.*\"(.*)\"', title)
print(quote.group(1))
这里发生的事情是,在获取页面的来源并找到 title
我们对标题使用正则表达式以提取引号内的文本。
我们告诉正则表达式在开始引号 (\"
) 之前的字符串开头 (^.*
) 查找任意数量的符号,然后捕获它和它之间的文本收盘价(第二个 \"
)。
然后我们通过告诉 Python 打印第一个捕获的组(正则表达式中括号之间的部分)来打印捕获的文本。
python - https://docs.python.org/3/library/re.html#match-objects
中有更多关于正则表达式匹配的内容只需在冒号处拆分文本:
In [1]: h = """<title>Imaan Z Hazir on Twitter: "Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"</title>"""
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(h, "lxml")
In [4]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"
其实看页面根本不需要拆分,正文在div.js里面的p标签里- tweet-text-container, th:
In [8]: import requests
In [9]: from bs4 import BeautifulSoup
In [10]: soup = BeautifulSoup(requests.get("https://twitter.com/ImaanZHazir/status/778560899061780481").content, "lxml")
In [11]: print(soup.select_one("div.js-tweet-text-container p").text)
Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)
In [12]: print(soup.title.text.split(": ", 1)[1])
"Guantanamo and Abu Ghraib, financial and military support to dictators in Latin America during the cold war. REALLY, AMERICA? (3)"
所以你可以用任何一种方法来获得相同的结果。