解析 url in pandas df 列并获取特定索引的值

Question

我有一个包含 url 列的 pandas df。数据如下所示：

row               url
1      'https://www.delish.com/cooking/recipe-ideas/recipes/four-cheese'
2      'https://www.delish.com/holiday-recipes/thanksgiving/thanksgiving-cabbage/
3      'https://www.delish.com/kitchen-tools/cookware-reviews/advice/kitchen-tools-gadgets/'

我只需要获取第二个索引的值，即烹饪或假日食谱等
期望的输出：

row               url
1               cooking
2               holiday-recipes
3               kitchen-tools

我想将 url 解析到不同的列中，然后删除我不需要的列。这是代码：

df['protocol'],df['domain'],df['path']=zip(*df['url'].map(urlparse(df['url']).urlsplit))

错误信息是：ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). 有没有更好的方法来解决这个问题？如何抓取具体索引？

Answer 1

这是您要找的吗？

df['url'] = df['url'].str.split('/').str[3]
print(df)

   row              url
0    1          cooking
1    2  holiday-recipes
2    3    kitchen-tools

Answer 2

另一种方法是在 com

之后立即将 alphas 与字符 - 匹配

df['url']=df['url'].str.extract('((?<=com\/)[a-z-]+)')



          url
0          cooking
1  holiday-recipes
2    kitchen-tools

解析 url in pandas df 列并获取特定索引的值

parse url in pandas df column and grab value of specific index

python

url-parsing

urlparse

dataframe

pandas