通过与列表比较从数据框列中获取国家名称
Get country name from dataframe column by comparing with a list
如何通过与包含国家名称的字符串列表进行比较,从数据框列中获取国家名称?
例如:
list = ["pakistan","united kingdom","uk","usa","united states","uae"]
# create dataframe column name is job_location of employee
df = pd.DataFrame({
'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
})
df
job_location
0 birmingham, england, united kingdom
1 new jersey, united states
2 gilgit-baltistan, pakistan
3 uae
4 united states
5 pakistan
6 31-c2, gulberg 3, lahore, pakistan
我需要在数据框名称中添加一个新列作为国家/地区,其中包含 job_location 列中的国家/地区名称。
使用 clist
作为列表名称,您可以制作一个正则表达式并使用 str.extract
:
reg = '(%s)' % '|'.join(clist)
df['country'] = df['job_location'].str.extract(reg)
输出:
job_location country
0 birmingham, england, united kingdom united kingdom
1 new jersey, united states united states
2 gilgit-baltistan, pakistan pakistan
3 uae uae
4 united states united states
5 pakistan pakistan
6 31-c2, gulberg 3, lahore, pakistan pakistan
但老实说,如果 job_location 的格式总是以国家/地区作为结尾,那么用逗号分隔并保留最后一个字段可能会更容易
不假设国家永远在最后,这里有一些应该起作用的东西:
import pandas as pd
country_list = ["pakistan","united kingdom","uk","usa","united states","uae"]
# create dataframe column name is job_location of employee
df = pd.DataFrame({
'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
})
matching_countries = []
for key, value in df.items():
for text in value:
for country in country_list:
if country in text:
matching_countries.append(country)
df['country'] = matching_countries
print (df)
输出:
job_location country
0 birmingham, england, united kingdom united kingdom
1 new jersey, united states united states
2 gilgit-baltistan, pakistan pakistan
3 uae uae
4 united states united states
5 pakistan pakistan
6 31-c2, gulberg 3, lahore, pakistan pakistan
首先,更改您的列表名称。我已经使用列表理解完成了..
df['country'] = [x.split(",")[-1] for x in df['job_location']]
输出:
job_location
country
0
birmingham, england, united kingdom
united kingdom
1
new jersey, united states
united states
2
gilgit-baltistan, pakistan
pakistan
3
uae
uae
4
united states
united states
5
pakistan
pakistan
6
31-c2, gulberg 3, lahore, pakistan
pakistan
如何通过与包含国家名称的字符串列表进行比较,从数据框列中获取国家名称?
例如:
list = ["pakistan","united kingdom","uk","usa","united states","uae"]
# create dataframe column name is job_location of employee
df = pd.DataFrame({
'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
})
df
job_location
0 birmingham, england, united kingdom
1 new jersey, united states
2 gilgit-baltistan, pakistan
3 uae
4 united states
5 pakistan
6 31-c2, gulberg 3, lahore, pakistan
我需要在数据框名称中添加一个新列作为国家/地区,其中包含 job_location 列中的国家/地区名称。
使用 clist
作为列表名称,您可以制作一个正则表达式并使用 str.extract
:
reg = '(%s)' % '|'.join(clist)
df['country'] = df['job_location'].str.extract(reg)
输出:
job_location country
0 birmingham, england, united kingdom united kingdom
1 new jersey, united states united states
2 gilgit-baltistan, pakistan pakistan
3 uae uae
4 united states united states
5 pakistan pakistan
6 31-c2, gulberg 3, lahore, pakistan pakistan
但老实说,如果 job_location 的格式总是以国家/地区作为结尾,那么用逗号分隔并保留最后一个字段可能会更容易
不假设国家永远在最后,这里有一些应该起作用的东西:
import pandas as pd
country_list = ["pakistan","united kingdom","uk","usa","united states","uae"]
# create dataframe column name is job_location of employee
df = pd.DataFrame({
'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
})
matching_countries = []
for key, value in df.items():
for text in value:
for country in country_list:
if country in text:
matching_countries.append(country)
df['country'] = matching_countries
print (df)
输出:
job_location country
0 birmingham, england, united kingdom united kingdom
1 new jersey, united states united states
2 gilgit-baltistan, pakistan pakistan
3 uae uae
4 united states united states
5 pakistan pakistan
6 31-c2, gulberg 3, lahore, pakistan pakistan
首先,更改您的列表名称。我已经使用列表理解完成了..
df['country'] = [x.split(",")[-1] for x in df['job_location']]
输出:
job_location | country | |
---|---|---|
0 | birmingham, england, united kingdom | united kingdom |
1 | new jersey, united states | united states |
2 | gilgit-baltistan, pakistan | pakistan |
3 | uae | uae |
4 | united states | united states |
5 | pakistan | pakistan |
6 | 31-c2, gulberg 3, lahore, pakistan | pakistan |