从列中的文本中提取国家/地区名称以创建另一列
Extract country name from text in column to create another column
我尝试了不同的组合来从列中提取国家/地区名称并创建一个仅包含国家/地区的新列。我可以对选定的行执行此操作,即 df.address[9998] 但不能对整个列执行此操作。
import pycountry
Cntr = []
for country in pycountry.countries:
for country.name in df.address:
Cntr.append(country.name)
知道这里出了什么问题吗?
编辑:
address 是 df 中的一个对象
df.address[:10] 看起来像这样
Address
0 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 ...
9 Kristiansand, Norway
Name: address, Length: 10, dtype: object
根据 Petar 的回应,当我 运行 个人查询时,我正确地得到了国家,但是当我尝试创建一个包含所有国家(或 df.address[:5] 这样的范围的列时,我得到一个空的 Cntr)
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df['address'][1]:
Cntr.append(country.name)
Cntr
Returns
[Italy]
and df.address[2] returns [ ]
etc.
我也有运行
df['address'] = df['address'].astype('str')
确保列中没有浮点数或整数。
你们真的很亲密。我们不能像这样循环 for country.name in df.address
。相反:
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df.address:
Cntr.append(country.name)
如果这不起作用,请提供更多信息,因为我不确定 df.address
是什么样子。
示例数据框
df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})
df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)
address city country
0 Turin, Italy Turin Italy
1 NaN NaN NaN
2 Zurich, Switzerland Zurich Switzerland
3 NaN NaN NaN
4 Glyfada, greece Glyfada greece
您可以使用函数clean_country()
from the library DataPrep。使用 pip install dataprep
.
安装
from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
address address_clean
0 Turin, Italy Italy
1 NaN NaN
2 Zurich, Switzerland Switzerland
3 NaN NaN
4 Glyfada, Greece Greece
我尝试了不同的组合来从列中提取国家/地区名称并创建一个仅包含国家/地区的新列。我可以对选定的行执行此操作,即 df.address[9998] 但不能对整个列执行此操作。
import pycountry
Cntr = []
for country in pycountry.countries:
for country.name in df.address:
Cntr.append(country.name)
知道这里出了什么问题吗?
编辑:
address 是 df 中的一个对象
df.address[:10] 看起来像这样
Address
0 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 ...
9 Kristiansand, Norway
Name: address, Length: 10, dtype: object
根据 Petar 的回应,当我 运行 个人查询时,我正确地得到了国家,但是当我尝试创建一个包含所有国家(或 df.address[:5] 这样的范围的列时,我得到一个空的 Cntr)
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df['address'][1]:
Cntr.append(country.name)
Cntr
Returns
[Italy]
and df.address[2] returns [ ]
etc.
我也有运行
df['address'] = df['address'].astype('str')
确保列中没有浮点数或整数。
你们真的很亲密。我们不能像这样循环 for country.name in df.address
。相反:
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df.address:
Cntr.append(country.name)
如果这不起作用,请提供更多信息,因为我不确定 df.address
是什么样子。
示例数据框
df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})
df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)
address city country
0 Turin, Italy Turin Italy
1 NaN NaN NaN
2 Zurich, Switzerland Zurich Switzerland
3 NaN NaN NaN
4 Glyfada, greece Glyfada greece
您可以使用函数clean_country()
from the library DataPrep。使用 pip install dataprep
.
from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
address address_clean
0 Turin, Italy Italy
1 NaN NaN
2 Zurich, Switzerland Switzerland
3 NaN NaN
4 Glyfada, Greece Greece