应用于提取长度和 TLD 信息的列的 Urlparse

Urlparse applied to a column for extracting length and TLD info

我正在尝试从 pandas 数据框中的网站列表中提取长度和后缀 (tld)。

Website.      Label
18egh.com       1
fish.co.uk      0
www.description.com 1
http://world.com 1

我想要的输出应该是

Website      Label    Length   Tld 
18egh.com       1        5      com
fish.co.uk      0        4      co.uk
www.description.com 1    11     com
http://world.com 1       5      com

我先试了一下长度如下:

def get_domain(df):  
    my_list=[]
    for x in df['Website'].tolist():
          domain = urlparse(x).netloc
          my_list.append(domain)
          df['Domain']  = my_list
          df['Length']=df['Domain'].str.len()
    return df

但是当我检查列表时是空的。我知道要提取有关域和 tld 的信息,使用 url 解析可能就足够了,但如果我错了,请指出正确的方向,我将不胜感激。

更新:

要提取域等,请尝试 tldextract 完成这项工作。

示例:

import pandas as pd
import tldextract # pip install tldextract | # conda install -c conda-forge tldextract

df = pd.DataFrame({'Website.': {0: '18egh.com',
  1: 'fish.co.uk',
  2: 'www.description.com',
  3: 'http://world.com',
  4: 'http://forums.news.cnn.com/'},
 'Label': {0: 1, 1: 0, 2: 1, 3: 1, 4: 0}})

df[['subdomin', 'domain', 'suffix']] = df.apply(lambda x: pd.Series(tldextract.extract(x['Website.'])), axis=1)

print(df)

                          Website.  Label     subdomin       domain suffix
    0                    18egh.com      1                     18egh    com
    1                   fish.co.uk      0                      fish  co.uk
    2          www.description.com      1          www  description    com
    3             http://world.com      1                     world    com
    4  http://forums.news.cnn.com/      0  forums.news          cnn    com

原回答如下


尝试:

import pandas as pd

df = pd.DataFrame({'Website.': {0: '18egh.com',
  1: 'fish.co.uk',
  2: 'www.description.com',
  3: 'http://world.com'},
 'Label': {0: 1, 1: 0, 2: 1, 3: 1}})

pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.'

df['Domain'] = df['Website.'].str.extract(pattern)
df['Domain_Len'] = df['Domain'].str.len()

print(df)

    Website.             Label  Domain          Domain_Len
0   18egh.com            1      18egh           5
1   fish.co.uk           0      fish            4
2   www.description.com  1      description     11
3   http://world.com     1      world           5

或者:

pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.(.*?)$'

df[['Domain', 'TLD']] = df['Website.'].str.extract(pattern, expand=True)
df['TLD_Len'] = df['TLD'].str.len()
df['Domain_Len'] = df['Domain'].str.len()

print(df)

    Website.             Label  TLD     TLD_Len     Domain       Domain_Len
0   18egh.com            1      com     3           18egh        5
1   fish.co.uk           0      co.uk   5           fish         4
2   www.description.com  1      com     3           description  11
3   http://world.com     1      com     3           world        5