通过与列表比较从数据框列中获取国家名称

Question

如何通过与包含国家名称的字符串列表进行比较，从数据框列中获取国家名称？

例如：

list = ["pakistan","united kingdom","uk","usa","united states","uae"]

# create dataframe column name is job_location of employee
df = pd.DataFrame({
        'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
    })
df 
job_location
0   birmingham, england, united kingdom
1   new jersey, united states
2   gilgit-baltistan, pakistan
3   uae
4   united states
5   pakistan
6   31-c2, gulberg 3, lahore, pakistan

我需要在数据框名称中添加一个新列作为国家/地区，其中包含 job_location 列中的国家/地区名称。

Answer 1

使用 clist 作为列表名称，您可以制作一个正则表达式并使用 str.extract:

reg = '(%s)' % '|'.join(clist)
df['country'] = df['job_location'].str.extract(reg)

输出：

                          job_location         country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

但老实说，如果 job_location 的格式总是以国家/地区作为结尾，那么用逗号分隔并保留最后一个字段可能会更容易

Answer 2

不假设国家永远在最后，这里有一些应该起作用的东西：

import pandas as pd

country_list = ["pakistan","united kingdom","uk","usa","united states","uae"]

# create dataframe column name is job_location of employee
df = pd.DataFrame({
        'job_location' : ['birmingham, england, united kingdom','new jersey, united states','gilgit-baltistan, pakistan','uae','united states','pakistan','31-c2, gulberg 3, lahore, pakistan'],
    })

matching_countries = []

for key, value in df.items():
    for text in value:
        for country in country_list:
                if country in text:
                    matching_countries.append(country)

df['country'] = matching_countries

print (df)

输出：

                          job_location         country
0  birmingham, england, united kingdom  united kingdom
1            new jersey, united states   united states
2           gilgit-baltistan, pakistan        pakistan
3                                  uae             uae
4                        united states   united states
5                             pakistan        pakistan
6   31-c2, gulberg 3, lahore, pakistan        pakistan

Answer 3

首先，更改您的列表名称。我已经使用列表理解完成了..

df['country'] = [x.split(",")[-1] for x in df['job_location']]

输出：

	job_location	country
0	birmingham, england, united kingdom	united kingdom
1	new jersey, united states	united states
2	gilgit-baltistan, pakistan	pakistan
3	uae	uae
4	united states	united states
5	pakistan	pakistan
6	31-c2, gulberg 3, lahore, pakistan	pakistan

通过与列表比较从数据框列中获取国家名称

Get country name from dataframe column by comparing with a list

python

substring

dataframe

pandas