通过与列表比较从数据框列中获取国家名称

Get country name from dataframe column by comparing with a list

如何通过与包含国家名称的字符串列表进行比较,从数据框列中获取国家名称?

例如:

list = ["pakistan","united kingdom","uk","usa","united states","uae"]

# create dataframe column name is job_location of employee
df = pd.DataFrame({
        'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
    })
df 
job_location
0   birmingham, england, united kingdom
1   new jersey, united states
2   gilgit-baltistan, pakistan
3   uae
4   united states
5   pakistan
6   31-c2, gulberg 3, lahore, pakistan

我需要在数据框名称中添加一个新列作为国家/地区,其中包含 job_location 列中的国家/地区名称。

使用 clist 作为列表名称,您可以制作一个正则表达式并使用 str.extract:

reg = '(%s)' % '|'.join(clist)
df['country'] = df['job_location'].str.extract(reg)

输出:

                          job_location         country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

但老实说,如果 job_location 的格式总是以国家/地区作为结尾,那么用逗号分隔并保留最后一个字段可能会更容易

不假设国家永远在最后,这里有一些应该起作用的东西:

import pandas as pd

country_list = ["pakistan","united kingdom","uk","usa","united states","uae"]

# create dataframe column name is job_location of employee
df = pd.DataFrame({
        'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
    })

matching_countries = []

for key, value in df.items():
    for text in value:
        for country in country_list:
                if country in text:
                    matching_countries.append(country)

df['country'] = matching_countries

print (df)

输出:

                          job_location         country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

首先,更改您的列表名称。我已经使用列表理解完成了..

df['country'] = [x.split(",")[-1] for x in df['job_location']]

输出:

job_location country
0 birmingham, england, united kingdom united kingdom
1 new jersey, united states united states
2 gilgit-baltistan, pakistan pakistan
3 uae uae
4 united states united states
5 pakistan pakistan
6 31-c2, gulberg 3, lahore, pakistan pakistan