Python 从 URL 中提取标题

Question

我正在使用以下函数尝试从网络抓取列表 urls 中提取标题。

我确实看过一些 SO 答案，但注意到许多人建议避免使用正则表达式解决方案。我想修复并构建我现有的解决方案，但很高兴收到其他优雅解决方案的建议。

示例 url1：https://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg/220px-Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg

示例 url 2: https://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Rembrandt_-_Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son_-_Google_Art_Project.jpg/220px-Rembrandt_-_Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son_-_Google_Art_Project.jpg

寻求从 url.

中提取标题的代码（函数）

def titleextract(url):
    #return unquote(url[58:url.rindex("/",58)-8].replace('_',''))
    cleanedtitle1=url[58:]
    title= cleanedtitle1.strip("-_Google_Art_Project.jpg/220px-")
    return title

以上内容对网址有以下影响：

Url 1: Rembrandt_-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-Google_Art_Project.jpg/220px-Rembrandt-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg

Url2:Rembrandt_van_Rijn_-Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist-Google_Art_Project。 jpg/220px-Rembrandt_van_Rijn-Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist-_Google_Art_Project.jpg

然而，所需的输出是：

Url 1: Rembrandt_-_Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son

Url 2: Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh2C_the_Wife_of_the_Artist

我正在努力解决的问题是删除此后的所有内容：_-Google_Art_Project.jpg/220px-Rembrandt-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg 对于每个独特的案例，然后删除不需要的字符（如果存在），例如 % in url2.

理想情况下，我也想去掉标题中的下划线。

任何使用我的现有代码并提供适当的逐步解释的建议都将不胜感激。

我删除开头的尝试成功了：

cleanedtitle1=url[58:]

但是我已经尝试了各种方法来去除字符和删除结尾，但都没有用：

title= cleanedtitle1.strip("-_Google_Art_Project.jpg/220px-")

根据一个建议，我也尝试过：

return unquote(url[58:url.rindex("/",58)-8].replace('_',''))

..但这并没有正确地删除不需要的文本，只是最后 8 个字符，但是由于它是可变的，所以这是行不通的。

我也试过这个，再次删除下划线 - 运气不好。

cleanedtitle1=url[58:]
    cleanedtitle2= cleanedtitle1.strip("-_Google_Art_Project.jpg/220px-")
    title = cleanedtitle2.strip("_")
    return title

到目前为止我的导入是：

from flask import Flask, render_template,url_for #importing flask class
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
from urllib.parse import unquote

出于学习目的，我很乐意接受与相关的答案，但理想情况下，我也希望能完成我已经开始的内容。

对于仅使用 BeautifulSoup 的答案，这里是完整的完整代码 （这个也很有参考价值）

from flask import Flask, render_template,url_for #importing flask class
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
from urllib.parse import unquote

app = Flask(__name__) #setting app variable to instance of flask class

@app.route('/') #this is what we type into our browser to go to pages. we create these using routes
@app.route('/home')
def home():
    images=imagescrape()
    titles=(titleextract(src) for src in images)
    images_titles=zip(images,titles)
    return render_template('home.html',images=images,images_titles=images_titles)   

def titleextract(url):
    pos1 = url.rindex("/")
    pos2 = url[:pos1].rindex("/")
    cleanedtitle1 = url[pos2 + 1: pos1]
    title = cleanedtitle1.replace("_-_Google_Art_Project.jpg", "")
    title = title.replace("_", " ")
    return title


def imagescrape():
    result_images=[]
    html = urlopen('https://en.wikipedia.org/wiki/Rembrandt')
    bs = BeautifulSoup(html, 'html.parser')
    images = bs.find_all('img', {'src':re.compile('.jpg')})
    for image in images:
        result_images.append("https:"+image['src']+'\n') #concatenation!
    return result_images

Answer 1

有几种方法可以做到这一点：

如果你只是想使用内置的 python 字符串函数，那么你可以首先在 / 的基础上拆分所有内容，然后剥离所有 [=24] 的公共部分=]的。

def titleextract(url):
    cleanedtitle1 = url.split("/")[-1]
    return cleanedtitle1[6:-4].replace('_',' ')

由于您已经在使用 bs4 导入，您可以通过以下方式完成：

soup = BeautifulSoup(htmlString, 'html.parser')
title = soup.title.text

Answer 2

这应该行得通，如有任何问题，请告诉我

def titleextract(url):
    title = url[58:]
    if "Google_Art_Project" in title:
        x = title.index("-_Google_Art_Project.jpg")
        title = title[:x] # Cut after where this is.

    disallowed_chars = "%" # Edit which chars should go.
    # Python will look at each character in turn. If it is not in the disallowed chars string, 
    # then it will be left. "".join() joins together all chars still allowed. 
    title = "".join(c for c in title if c not in disallowed_chars)

    title = title.replace("_"," ") # Change underscores to spaces.
    return title

Answer 3

从你的开始：

cleanedtitle1=url[58:]

这可行，但它对硬编码数字可能不是很稳健，所以让我们从倒数第二个“/”之后的字符开始。

您可以使用正则表达式来做到这一点，但更简单地说，这可能看起来像：

pos1 = url.rindex("/")  # index of last /
pos2 = url[:pos1].rindex("/")  # index of second-to-last /
cleanedtitle1 = url[pos2 + 1:]

尽管实际上，您只对倒数第二个和最后一个 / 之间的位感兴趣，所以让我们更改使用我们发现的 pos1 作为中间值:

pos1 = url.rindex("/")  # index of last /
pos2 = url[:pos1].rindex("/")  # index of second-to-last /
cleanedtitle1 = url[pos2 + 1: pos1]

在这里，这给出了 cleanedtitle1

的以下值

'Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg'

现在开始 strip。这不会完全符合您的要求：它将遍历您提供的字符串，给出该字符串中的各个字符，然后删除 all 次 这些字符中的每个。

因此，让我们使用 replace，并将字符串替换为空字符串。

title = cleanedtitle1.replace("_-_Google_Art_Project.jpg", "")

然后我们也可以做类似的事情：

title = title.replace("_", " ")

然后我们得到：

'Rembrandt van Rijn - Self-Portrait'

放在一起：

pos1 = url.rindex("/")
pos2 = url[:pos1].rindex("/")
cleanedtitle1 = url[pos2 + 1: pos1]
title = cleanedtitle1.replace("_-_Google_Art_Project.jpg", "")
title = title.replace("_", " ")
return title

更新

我错过了一个事实，即 URL 可能包含我们希望替换的 %2C 等序列。

这些可以使用 replace 以相同的方式完成，例如：

url = url.replace("%2C", ",")

但是您必须对所有可能出现的相似序列执行此操作，因此最好使用 urllib 中可用的 unquote 函数。如果在代码的顶部放置：

from urllib.parse import unquote

然后您可以使用

进行这些替换

url = unquote(url)

在剩下的处理之前：

from urllib.parse import unquote

def titleextract(url):
    url = unquote(url)
    pos1 = url.rindex("/")
    pos2 = url[:pos1].rindex("/")
    cleanedtitle1 = url[pos2 + 1: pos1]
    title = cleanedtitle1.replace("_-_Google_Art_Project.jpg", "")
    title = title.replace("_", " ")
    return title

Python 从 URL 中提取标题

Python extract title from URL

python

string

replace

strip

slice

更新