如何使用 pandas 打印选定的子字符串（包含在数据框列中）设置条件

Question

大家好。我有一个庞大的数据集，其中有几个由 ISO 代码表示的国家/地区。不管怎样，有些国家显示的是官方名称，但没有 ISO 代码。我想找到它们，然后用各自的 iso 代码替换它们。

这是我的 df 示例：

| TERRITORY               |
 -----------------------
| IT, GB, USA, France     |
| ES, Russia, Germany, PT |
| EG, LY, DZ              |

预期输出：

'The nations that were not converted are:' France, Russia, Germany

最大的问题是这些国家在同一个单元格中，并被视为一个单一的值。我想让程序只打印长度超过两个字符的子字符串，但经过多次尝试，我什么也没得到。

有人可以帮我吗？

Answer 1

IIUC，您可以 split+explode 并映射到已知的代码列表（此处使用 pycountry）：

import pycountry
codes = {c.alpha_2 for c in pycountry.countries}
# or manually set
# codes = {'IT', 'GB', 'USA', 'FR'...}

s = df['TERRITORY'].str.split(', ').explode().drop_duplicates()
print(f'The nations that were not converted are: {", ".join(s[~s.isin(codes)])}')

输出：

The nations that were not converted are: USA, France, Russia, Germany

如何使用 pandas 打印选定的子字符串（包含在数据框列中）设置条件

How to print just selected substrings (contained in a dataframe column) setting conditions with pandas

python

substring

dataframe

pandas