如何根据字符串列表对字符串值进行分类?
How to bin string values according to list of strings?
有没有办法将 pandas 列的字符串合并到自定义名称的自定义组中。类似于 cuts 函数,但用于字符串。
例如,可以使用列表的列表来定义什么是组。
grouping_lists = [['Pakistan', 'China', 'Iran'], ['Germany', 'UK', 'Poland'],
['Australia'], ['USA']]
对应名字['Asia', 'Europe', 'Australia', 'Other']
.
如果列表中不存在某些内容,则会将其标记为 'Other'
或其他内容。
示例:
my_id country_name
0 100 Pakistan
1 200 Germany
2 140 Australia
3 400 Germany
4 225 China
5 125 Pakistan
6 600 Poland
7 0 Austria
my_id country_name Groups
0 100 Pakistan Asia
1 200 Germany Europe
2 140 Australia Australia
3 400 Germany Europe
4 225 China Asia
5 125 Pakistan Asia
6 600 Poland Europe
7 0 Austria Other
您可以将分组列表修改为字典,然后使用 pandas.Series.map
,而不是将您的答案装箱
country_map = {
'Pakistan': 'Asia', 'China': 'Asia',
'Iran': 'Asia', 'Germany': 'Europe',
'UK': 'Europe', 'Poland': 'Europe',
'Australia': 'Australia', 'USA': 'Other'
}
df.assign(Groups=df.country_name.map(country_map)).fillna('Other')
my_id country_name Groups
0 100 Pakistan Asia
1 200 Germany Europe
2 140 Australia Australia
3 400 Germany Europe
4 225 China Asia
5 125 Pakistan Asia
6 600 Poland Europe
7 0 Austria Other
这是一种不需要手动创建地图字典的方法(以防它很大):
grouping_lists = [['Pakistan', 'China', 'Iran'], ['Germany', 'UK', 'Poland'],
['Australia'], ['USA']]
names = ['Asia', 'Europe', 'Australia', 'Other']
# create a df with mapping information
maps = (pd.DataFrame({'Groups': names, 'country_name': grouping_lists})
.explode('country_name')
.reset_index(drop=True))
# join maps
df = df.merge(maps, on = 'country_name', how='left').fillna("Other")
my_id country_name Groups
0 100 Pakistan Asia
1 200 Germany Europe
2 140 Australia Australia
3 400 Germany Europe
4 225 China Asia
5 125 Pakistan Asia
6 600 Poland Europe
7 0 Austria Other
如果您不担心速度,可以使用 lambda。
groups = {
"Asia": ["Pakistan", "China", "Iran"],
"Europe": ["Germany", "UK", "Poland"],
"Australia": ["Australia"],
}
df["Groups"] = (
df["country_names"]
.apply(lambda x: [k for k in groups.keys() if x in groups[k]])
.str[0]
.fillna("Other")
)
有没有办法将 pandas 列的字符串合并到自定义名称的自定义组中。类似于 cuts 函数,但用于字符串。
例如,可以使用列表的列表来定义什么是组。
grouping_lists = [['Pakistan', 'China', 'Iran'], ['Germany', 'UK', 'Poland'],
['Australia'], ['USA']]
对应名字['Asia', 'Europe', 'Australia', 'Other']
.
如果列表中不存在某些内容,则会将其标记为 'Other'
或其他内容。
示例:
my_id country_name
0 100 Pakistan
1 200 Germany
2 140 Australia
3 400 Germany
4 225 China
5 125 Pakistan
6 600 Poland
7 0 Austria
my_id country_name Groups
0 100 Pakistan Asia
1 200 Germany Europe
2 140 Australia Australia
3 400 Germany Europe
4 225 China Asia
5 125 Pakistan Asia
6 600 Poland Europe
7 0 Austria Other
您可以将分组列表修改为字典,然后使用 pandas.Series.map
country_map = {
'Pakistan': 'Asia', 'China': 'Asia',
'Iran': 'Asia', 'Germany': 'Europe',
'UK': 'Europe', 'Poland': 'Europe',
'Australia': 'Australia', 'USA': 'Other'
}
df.assign(Groups=df.country_name.map(country_map)).fillna('Other')
my_id country_name Groups
0 100 Pakistan Asia
1 200 Germany Europe
2 140 Australia Australia
3 400 Germany Europe
4 225 China Asia
5 125 Pakistan Asia
6 600 Poland Europe
7 0 Austria Other
这是一种不需要手动创建地图字典的方法(以防它很大):
grouping_lists = [['Pakistan', 'China', 'Iran'], ['Germany', 'UK', 'Poland'],
['Australia'], ['USA']]
names = ['Asia', 'Europe', 'Australia', 'Other']
# create a df with mapping information
maps = (pd.DataFrame({'Groups': names, 'country_name': grouping_lists})
.explode('country_name')
.reset_index(drop=True))
# join maps
df = df.merge(maps, on = 'country_name', how='left').fillna("Other")
my_id country_name Groups
0 100 Pakistan Asia
1 200 Germany Europe
2 140 Australia Australia
3 400 Germany Europe
4 225 China Asia
5 125 Pakistan Asia
6 600 Poland Europe
7 0 Austria Other
如果您不担心速度,可以使用 lambda。
groups = {
"Asia": ["Pakistan", "China", "Iran"],
"Europe": ["Germany", "UK", "Poland"],
"Australia": ["Australia"],
}
df["Groups"] = (
df["country_names"]
.apply(lambda x: [k for k in groups.keys() if x in groups[k]])
.str[0]
.fillna("Other")
)