如何使用一组字符出现的特定开始和结束位置进行子串？

Question

我正在尝试清理从他们的 link 中抓取的数据。我要清理的 CSV 文件中有 100 多个 link。

这是 link 在 CSV 中的样子：

"https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

我观察到为 HTML 数据抓取这个并不顺利，我必须在其中显示 URL。我想获取以 &url= 开头并以 &ct 结尾的子字符串，因为那是真正的 URL 所在的位置。

我读过这样的帖子，但也找不到结尾 str 的帖子。我已经使用 substring 包尝试了 this 的一种方法，但它不适用于多个字符。

我该怎么做？最好不要使用第三方包？

Answer 1

我不明白问题

如果你有字符串，那么你可以使用字符串函数，比如 .find() 和切片 [start:end]

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

start = text.find('url=') + len('url=')
end   = text.find('&ct=')

text[start:end]

但是 url= 和 ct= 的顺序可能不同，所以最好先搜索 & 再搜索 url=

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

start = text.find('url=') + len('url=')
end   = text.find('&', start)

text[start:end]

编辑：

还有标准模块 urllib.parse 可与 url 一起使用 - 拆分或加入它。

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

import urllib.parse

url, query = urllib.parse.splitquery(text)
data       = urllib.parse.parse_qs(query)

data['url'][0]

在 data 你有字典

{'cd': ['SldisGkopisopiasenjA6Y28Ug'],
 'ct': ['ga'],
 'rct': ['j'],
 'sa': ['t'],
 'url': ['https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428'],
 'usg': ['AFQjaskdfYJkasKugowe896fsdgfsweF']}

编辑：

Python 显示 splitquery() 为 deprecated as of 3.8 的警告，代码应使用 urlparse()

text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"

import urllib.parse

parts = urllib.parse.urlparse(text)
data  = urllib.parse.parse_qs(parts.query)

data['url'][0]

如何使用一组字符出现的特定开始和结束位置进行子串？

How to substring with specific start and end positions where a set of characters appear?

python

string

substring

web-scraping