从列中的文本中提取国家/地区名称以创建另一列

Extract country name from text in column to create another column

我尝试了不同的组合来从列中提取国家/地区名称并创建一个仅包含国家/地区的新列。我可以对选定的行执行此操作,即 df.address[9998] 但不能对整个列执行此操作。

import pycountry
Cntr = []
for country in pycountry.countries:
    for country.name in df.address:
        Cntr.append(country.name)

知道这里出了什么问题吗?

编辑:

address 是 df 中的一个对象

df.address[:10] 看起来像这样

       Address
0    Turin, Italy        
1    NaN                 
2    Zurich, Switzerland 
3    NaN                 
4    Glyfada, Greece     
5    Frosinone, Italy    
6    Dublin, Ireland     
7    NaN                 
8    Turin, Italy        
1    NaN                 
2    Zurich, Switzerland 
3    NaN                 
4    Glyfada, Greece     
5    Frosinone, Italy    
6    Dublin, Ireland     
7    NaN                 
8   ...                  
9    Kristiansand, Norway
Name: address, Length: 10, dtype: object

根据 Petar 的回应,当我 运行 个人查询时,我正确地得到了国家,但是当我尝试创建一个包含所有国家(或 df.address[:5] 这样的范围的列时,我得到一个空的 Cntr)

    import pycountry
    Cntr = []
    for country in pycountry.countries:
        if country.name in df['address'][1]:
            Cntr.append(country.name)
Cntr
Returns
[Italy]

and df.address[2] returns [ ] 
etc.

我也有运行 df['address'] = df['address'].astype('str')

确保列中没有浮点数或整数。

你们真的很亲密。我们不能像这样循环 for country.name in df.address。相反:

import pycountry
Cntr = []
for country in pycountry.countries:
    if country.name in df.address:
        Cntr.append(country.name)

如果这不起作用,请提供更多信息,因为我不确定 df.address 是什么样子。

示例数据框 df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})

df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)

               address     city       country
0         Turin, Italy    Turin         Italy
1                  NaN      NaN           NaN
2  Zurich, Switzerland   Zurich   Switzerland
3                  NaN      NaN           NaN
4      Glyfada, greece  Glyfada        greece

您可以使用函数clean_country() from the library DataPrep。使用 pip install dataprep.

安装
from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
               address address_clean
0         Turin, Italy         Italy
1                  NaN           NaN
2  Zurich, Switzerland   Switzerland
3                  NaN           NaN
4      Glyfada, Greece        Greece