有没有办法使用 python 中的硒来抓取位于地址栏中的页面 url（或其一部分）？

Question

我正在处理一个庞大的电影数据集，我正在尝试从 IMDB 网站获取每部电影的 IMDb ID。我在 Python 中使用 selenium。我检查了，但在电影页面中找不到 IMDB 代码。它包含在页面的 link 中，位于地址栏中，我不知道如何抓取它。有什么方法可以做到这一点吗？

这是页面示例：

我需要获取 url 中带下划线的部分。

有人知道怎么做吗？

Answer 1

尝试driver.current_url

参考：https://selenium-python.readthedocs.io/api.html

此外，值得注意的是 IMDB 有官方 API。你也可以看看https://aws.amazon.com/marketplace/pp/prodview-bj74roaptgdpi?sr=0-1&ref_=beagle&applicationId=AWSMPContessa

Answer 2

如果你想获取电影 url 的 title 你需要先获取 current_url 然后使用 python split() 函数你可以获取倒数第二个字符串。

currenturl=driver.current_url.split("/")[-2]
print(currenturl)

这将返回 tt1877830

Answer 3

提取页面url9或其中的一部分，即下划线部分）例如tt1877830，您可以提取并将其拆分为 / 字符，您可以使用以下任一解决方案：

使用正指数:

driver.get('https://www.imdb.com/title/tt1877830/?ref_=fn_al_tt_1')
WebDriverWait(driver, 20).until(EC.url_contains("title"))
print(driver.current_url.split("/")[4])

控制台输出：
```
tt1877830
```

使用负索引:

driver.get('https://www.imdb.com/title/tt1877830/?ref_=fn_al_tt_1')
WebDriverWait(driver, 20).until(EC.url_contains("title"))
print(driver.current_url.split("/")[-2])

控制台输出：
```
tt1877830
```

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

有没有办法使用 python 中的硒来抓取位于地址栏中的页面 url（或其一部分）？

Is there a way to scrape the page url (or a part of it) located in the address bar using selenium in python?

python

selenium

imdb

web-scraping