应用于提取长度和 TLD 信息的列的 Urlparse
Urlparse applied to a column for extracting length and TLD info
我正在尝试从 pandas 数据框中的网站列表中提取长度和后缀 (tld)。
Website. Label
18egh.com 1
fish.co.uk 0
www.description.com 1
http://world.com 1
我想要的输出应该是
Website Label Length Tld
18egh.com 1 5 com
fish.co.uk 0 4 co.uk
www.description.com 1 11 com
http://world.com 1 5 com
我先试了一下长度如下:
def get_domain(df):
my_list=[]
for x in df['Website'].tolist():
domain = urlparse(x).netloc
my_list.append(domain)
df['Domain'] = my_list
df['Length']=df['Domain'].str.len()
return df
但是当我检查列表时是空的。我知道要提取有关域和 tld 的信息,使用 url 解析可能就足够了,但如果我错了,请指出正确的方向,我将不胜感激。
更新:
要提取域等,请尝试 tldextract
完成这项工作。
示例:
import pandas as pd
import tldextract # pip install tldextract | # conda install -c conda-forge tldextract
df = pd.DataFrame({'Website.': {0: '18egh.com',
1: 'fish.co.uk',
2: 'www.description.com',
3: 'http://world.com',
4: 'http://forums.news.cnn.com/'},
'Label': {0: 1, 1: 0, 2: 1, 3: 1, 4: 0}})
df[['subdomin', 'domain', 'suffix']] = df.apply(lambda x: pd.Series(tldextract.extract(x['Website.'])), axis=1)
print(df)
Website. Label subdomin domain suffix
0 18egh.com 1 18egh com
1 fish.co.uk 0 fish co.uk
2 www.description.com 1 www description com
3 http://world.com 1 world com
4 http://forums.news.cnn.com/ 0 forums.news cnn com
原回答如下
尝试:
import pandas as pd
df = pd.DataFrame({'Website.': {0: '18egh.com',
1: 'fish.co.uk',
2: 'www.description.com',
3: 'http://world.com'},
'Label': {0: 1, 1: 0, 2: 1, 3: 1}})
pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.'
df['Domain'] = df['Website.'].str.extract(pattern)
df['Domain_Len'] = df['Domain'].str.len()
print(df)
Website. Label Domain Domain_Len
0 18egh.com 1 18egh 5
1 fish.co.uk 0 fish 4
2 www.description.com 1 description 11
3 http://world.com 1 world 5
或者:
pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.(.*?)$'
df[['Domain', 'TLD']] = df['Website.'].str.extract(pattern, expand=True)
df['TLD_Len'] = df['TLD'].str.len()
df['Domain_Len'] = df['Domain'].str.len()
print(df)
Website. Label TLD TLD_Len Domain Domain_Len
0 18egh.com 1 com 3 18egh 5
1 fish.co.uk 0 co.uk 5 fish 4
2 www.description.com 1 com 3 description 11
3 http://world.com 1 com 3 world 5
我正在尝试从 pandas 数据框中的网站列表中提取长度和后缀 (tld)。
Website. Label
18egh.com 1
fish.co.uk 0
www.description.com 1
http://world.com 1
我想要的输出应该是
Website Label Length Tld
18egh.com 1 5 com
fish.co.uk 0 4 co.uk
www.description.com 1 11 com
http://world.com 1 5 com
我先试了一下长度如下:
def get_domain(df):
my_list=[]
for x in df['Website'].tolist():
domain = urlparse(x).netloc
my_list.append(domain)
df['Domain'] = my_list
df['Length']=df['Domain'].str.len()
return df
但是当我检查列表时是空的。我知道要提取有关域和 tld 的信息,使用 url 解析可能就足够了,但如果我错了,请指出正确的方向,我将不胜感激。
更新:
要提取域等,请尝试 tldextract
完成这项工作。
示例:
import pandas as pd
import tldextract # pip install tldextract | # conda install -c conda-forge tldextract
df = pd.DataFrame({'Website.': {0: '18egh.com',
1: 'fish.co.uk',
2: 'www.description.com',
3: 'http://world.com',
4: 'http://forums.news.cnn.com/'},
'Label': {0: 1, 1: 0, 2: 1, 3: 1, 4: 0}})
df[['subdomin', 'domain', 'suffix']] = df.apply(lambda x: pd.Series(tldextract.extract(x['Website.'])), axis=1)
print(df)
Website. Label subdomin domain suffix
0 18egh.com 1 18egh com
1 fish.co.uk 0 fish co.uk
2 www.description.com 1 www description com
3 http://world.com 1 world com
4 http://forums.news.cnn.com/ 0 forums.news cnn com
原回答如下
尝试:
import pandas as pd
df = pd.DataFrame({'Website.': {0: '18egh.com',
1: 'fish.co.uk',
2: 'www.description.com',
3: 'http://world.com'},
'Label': {0: 1, 1: 0, 2: 1, 3: 1}})
pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.'
df['Domain'] = df['Website.'].str.extract(pattern)
df['Domain_Len'] = df['Domain'].str.len()
print(df)
Website. Label Domain Domain_Len
0 18egh.com 1 18egh 5
1 fish.co.uk 0 fish 4
2 www.description.com 1 description 11
3 http://world.com 1 world 5
或者:
pattern = r'(?:https?:\/\/|www\.|https?:\/\/www\.)?(.*?)\.(.*?)$'
df[['Domain', 'TLD']] = df['Website.'].str.extract(pattern, expand=True)
df['TLD_Len'] = df['TLD'].str.len()
df['Domain_Len'] = df['Domain'].str.len()
print(df)
Website. Label TLD TLD_Len Domain Domain_Len
0 18egh.com 1 com 3 18egh 5
1 fish.co.uk 0 co.uk 5 fish 4
2 www.description.com 1 com 3 description 11
3 http://world.com 1 com 3 world 5