如何使用 Python 将国家代码转换为全名并根据 Excel 文件中的城市名称推断国家名称?

How can I use Python to turn country code into full name and infer the country name based on the city name on an Excel file?

我是 Python 的初学者。

现在我的 Excel 文件中有 2 列。一个是国家栏,一个是城市栏。

对于国家一栏,大部分值以国家代码显示,部分值以国家全名显示,而部分值是 U.S.A 州代码,只有不到 1% 的值是空白.

对于城市栏,它清楚地显示了完整的城市名称(不是城市代码),而其中近 20% 是空白。

如何使用 Python 创建一个新列以根据国家代码显示完整的国家/地区名称,如果在国家/地区列中显示完整的国家/地区名称,则保持相同的名称,并显示U.S.A 在新列中将代码声明为美国?

棘手的是,在country一栏,以CO为例,Co可以代表Columbia和Colorado,一开始我不能确定是国家还是州,但是当我查看对应的城市名称我可以知道它是一个国家或州(例如:科罗拉多州的朗蒙特,哥伦比亚的波哥大)。如何在新列中避免此问题并根据相应的城市名称推断新列中的完整国家/地区名称?

感谢您的帮助!

对此方法的一个建议是创建字典(即 dic = {'CO':'Colombia',...}dic_state = {'CO':'Colorado',...})。然后,可能有一个 if 语句来检查国家是否是美国。如果是美国,则使用 dic_state。最后,您可以使用适当的命令创建一个新列(这取决于您使用的package/module)

祝你好运!

嗯,你可以有一个 {key (state) : Values (cities belonging to states)} json 并使用 python 读取文件并将列表排列到相应的城市,状态。

说明

使用以下逻辑对任务进行编码。

  1. 处理简单的缩写如U.S.
  2. 国家长度大于3
    1. 有国家和城市
      • 在城市中查找最近的乡村城市对
    2. 仅限国家/地区
      • 在国家列表中以两个字母的国家/地区代码查找最接近的国家/地区匹配项
  3. 国家/地区长度等于 3
    • 查找国家代码为 3 个字母的国家
  4. 国家/地区长度等于 2(可以是国家或州代码)
  5. 状态列表中不存在代码
    • 必须是国家代码,所以用两个字母的国家代码查找国家
  6. 国家列表中不存在代码
    • 必须是美国的州代码,所以国家是美国
  7. 可以是国家或州代码
    • 检查城市是否以此作为州代码
    • 检查城市是否以此作为国家代码
    • 必须是这两种可能性的最佳匹配

注意:字符串匹配使用模糊匹配以允许灵活地拼写名称 rapidfuzz 库被用于 fuzzywuzzy,因为它快一个数量级

代码

import pandas as pd
from rapidfuzz import fuzz

def find_closest_country(country):
    ' Country with the closest name in list of countries in country code '
    ratios = [fuzz.partial_ratio(country, x) for x in alpha2.values()]
    rated_countries = [(info, r) for info, r in zip(alpha2.values(), ratios)]
    
    # Best match with shortest name
    return sorted(rated_countries, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
    
def check_city_country(city, country):
    ' City, Country pair closest in list of cities '
    ratios = [fuzz.partial_ratio(city, x['name']) * fuzz.partial_ratio(country, x['country']) for x in cities]
    rated_cities = [(info, r) for info, r in zip(cities, ratios)]
    
    # Best match with shortest name
    return sorted(rated_cities, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
    
def check_city_subregion(city, subregion):
    ' City, subresion pair closest in list of cities '
    ratios = [fuzz.partial_ratio(city, x['name']) * fuzz.partial_ratio(subregion, x['subcountry']) for x in cities]
    rated_cities = [(info, r) for info, r in zip(cities, ratios)]
    
    # Best match with shortest name
    return sorted(rated_cities, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
    
def lookup(country, city):
    '''
        Finds country based upon country and city
        country - country name or country code
        city - name of city
    '''
    if country.lower() == 'u.s.':
        # Picks up common US acronym
        country = "US"
   
    if len(country) > 3:
        # Must be country since too long for abbreviation
        if city:
            # Find closest city country pair in list of cities
            city_info = check_city_country(city, country)
            if city_info:
                return city_info[0]['country']
       
        # No city, so find closest country in list of countries (2 code abbreviations reverse lookup)
        countries = find_closest_country(country)
        if countries:
            return countries[0]
        
        return None
    elif len(country) == 3:
        # 3 letter abbreviation
        country = country.upper()
        return alpha3.get(country, None)
    
    elif len(country) == 2:
        # Two letter country abbreviation
        country = country.upper()
        if not country in states:
            # Not a state code, so lookup contry from code
            return alpha2.get(country, None)
        
        if not country in alpha2:
            # Not a country code, so must be state code for US
            return "United States of America"
        
        # Could be country of state code
        
        if city:
            # Have 2 digit code (could be country or state)
            pos_country = alpha2[country]  # possible country
            pos_state = states[country]    # possible state
            
            # check closest country with this city
            pos_countries = check_city_country(city, pos_country)
            
            # If state code, country would be United States
            pos_us = check_city_country(city, "United States")
            
            if pos_countries[1] > pos_us[1]:
                # Provided better match as country code
                return pos_countries[0]['country']
            else:
                # Provided better match as state code (i.e. "United States")
                return pos_us[0]['country']
        else:
            return alpha2[country]
             
    else:
        return None
   

数据

# State Codes
# https://gist.github.com/rugbyprof/76575b470b6772ce8fa0c49e23931d97
states = {"AL":"Alabama","AK":"Alaska","AZ":"Arizona","AR":"Arkansas","CA":"California","CO":"Colorado","CT":"Connecticut","DE":"Delaware","FL":"Florida","GA":"Georgia","HI":"Hawaii","ID":"Idaho","IL":"Illinois","IN":"Indiana","IA":"Iowa","KS":"Kansas","KY":"Kentucky","LA":"Louisiana","ME":"Maine","MD":"Maryland","MA":"Massachusetts","MI":"Michigan","MN":"Minnesota","MS":"Mississippi","MO":"Missouri","MT":"Montana","NE":"Nebraska","NV":"Nevada","NH":"New Hampshire","NJ":"New Jersey","NM":"New Mexico","NY":"New York","NC":"North Carolina","ND":"North Dakota","OH":"Ohio","OK":"Oklahoma","OR":"Oregon","PA":"Pennsylvania","RI":"Rhode Island","SC":"South Carolina","SD":"South Dakota","TN":"Tennessee","TX":"Texas","UT":"Utah","VT":"Vermont","VA":"Virginia","WA":"Washington","WV":"West Virginia","WI":"Wisconsin","WY":"Wyoming"}

# two letter country codes
# https://gist.github.com/carlopires/1261951/d13ca7320a6abcd4b0aa800d351a31b54cefdff4
alpha2 = {
    'AD': 'Andorra',
    'AE': 'United Arab Emirates',
    'AF': 'Afghanistan',
    'AG': 'Antigua & Barbuda',
    'AI': 'Anguilla',
    'AL': 'Albania',
    'AM': 'Armenia',
    'AN': 'Netherlands Antilles',
    'AO': 'Angola',
    'AQ': 'Antarctica',
    'AR': 'Argentina',
    'AS': 'American Samoa',
    'AT': 'Austria',
    'AU': 'Australia',
    'AW': 'Aruba',
    'AZ': 'Azerbaijan',
    'BA': 'Bosnia and Herzegovina',
    'BB': 'Barbados',
    'BD': 'Bangladesh',
    'BE': 'Belgium',
    'BF': 'Burkina Faso',
    'BG': 'Bulgaria',
    'BH': 'Bahrain',
    'BI': 'Burundi',
    'BJ': 'Benin',
    'BM': 'Bermuda',
    'BN': 'Brunei Darussalam',
    'BO': 'Bolivia',
    'BR': 'Brazil',
    'BS': 'Bahama',
    'BT': 'Bhutan',
    'BU': 'Burma (no longer exists)',
    'BV': 'Bouvet Island',
    'BW': 'Botswana',
    'BY': 'Belarus',
    'BZ': 'Belize',
    'CA': 'Canada',
    'CC': 'Cocos (Keeling) Islands',
    'CF': 'Central African Republic',
    'CG': 'Congo',
    'CH': 'Switzerland',
    'CI': 'Côte D\'ivoire (Ivory Coast)',
    'CK': 'Cook Iislands',
    'CL': 'Chile',
    'CM': 'Cameroon',
    'CN': 'China',
    'CO': 'Colombia',
    'CR': 'Costa Rica',
    'CS': 'Czechoslovakia (no longer exists)',
    'CU': 'Cuba',
    'CV': 'Cape Verde',
    'CX': 'Christmas Island',
    'CY': 'Cyprus',
    'CZ': 'Czech Republic',
    'DD': 'German Democratic Republic (no longer exists)',
    'DE': 'Germany',
    'DJ': 'Djibouti',
    'DK': 'Denmark',
    'DM': 'Dominica',
    'DO': 'Dominican Republic',
    'DZ': 'Algeria',
    'EC': 'Ecuador',
    'EE': 'Estonia',
    'EG': 'Egypt',
    'EH': 'Western Sahara',
    'ER': 'Eritrea',
    'ES': 'Spain',
    'ET': 'Ethiopia',
    'FI': 'Finland',
    'FJ': 'Fiji',
    'FK': 'Falkland Islands (Malvinas)',
    'FM': 'Micronesia',
    'FO': 'Faroe Islands',
    'FR': 'France',
    'FX': 'France, Metropolitan',
    'GA': 'Gabon',
    'GB': 'United Kingdom (Great Britain)',
    'GD': 'Grenada',
    'GE': 'Georgia',
    'GF': 'French Guiana',
    'GH': 'Ghana',
    'GI': 'Gibraltar',
    'GL': 'Greenland',
    'GM': 'Gambia',
    'GN': 'Guinea',
    'GP': 'Guadeloupe',
    'GQ': 'Equatorial Guinea',
    'GR': 'Greece',
    'GS': 'South Georgia and the South Sandwich Islands',
    'GT': 'Guatemala',
    'GU': 'Guam',
    'GW': 'Guinea-Bissau',
    'GY': 'Guyana',
    'HK': 'Hong Kong',
    'HM': 'Heard & McDonald Islands',
    'HN': 'Honduras',
    'HR': 'Croatia',
    'HT': 'Haiti',
    'HU': 'Hungary',
    'ID': 'Indonesia',
    'IE': 'Ireland',
    'IL': 'Israel',
    'IN': 'India',
    'IO': 'British Indian Ocean Territory',
    'IQ': 'Iraq',
    'IR': 'Islamic Republic of Iran',
    'IS': 'Iceland',
    'IT': 'Italy',
    'JM': 'Jamaica',
    'JO': 'Jordan',
    'JP': 'Japan',
    'KE': 'Kenya',
    'KG': 'Kyrgyzstan',
    'KH': 'Cambodia',
    'KI': 'Kiribati',
    'KM': 'Comoros',
    'KN': 'St. Kitts and Nevis',
    'KP': 'Korea, Democratic People\'s Republic of',
    'KR': 'Korea, Republic of',
    'KW': 'Kuwait',
    'KY': 'Cayman Islands',
    'KZ': 'Kazakhstan',
    'LA': 'Lao People\'s Democratic Republic',
    'LB': 'Lebanon',
    'LC': 'Saint Lucia',
    'LI': 'Liechtenstein',
    'LK': 'Sri Lanka',
    'LR': 'Liberia',
    'LS': 'Lesotho',
    'LT': 'Lithuania',
    'LU': 'Luxembourg',
    'LV': 'Latvia',
    'LY': 'Libyan Arab Jamahiriya',
    'MA': 'Morocco',
    'MC': 'Monaco',
    'MD': 'Moldova, Republic of',
    'MG': 'Madagascar',
    'MH': 'Marshall Islands',
    'ML': 'Mali',
    'MN': 'Mongolia',
    'MM': 'Myanmar',
    'MO': 'Macau',
    'MP': 'Northern Mariana Islands',
    'MQ': 'Martinique',
    'MR': 'Mauritania',
    'MS': 'Monserrat',
    'MT': 'Malta',
    'MU': 'Mauritius',
    'MV': 'Maldives',
    'MW': 'Malawi',
    'MX': 'Mexico',
    'MY': 'Malaysia',
    'MZ': 'Mozambique',
    'NA': 'Namibia',
    'NC': 'New Caledonia',
    'NE': 'Niger',
    'NF': 'Norfolk Island',
    'NG': 'Nigeria',
    'NI': 'Nicaragua',
    'NL': 'Netherlands',
    'NO': 'Norway',
    'NP': 'Nepal',
    'NR': 'Nauru',
    'NT': 'Neutral Zone (no longer exists)',
    'NU': 'Niue',
    'NZ': 'New Zealand',
    'OM': 'Oman',
    'PA': 'Panama',
    'PE': 'Peru',
    'PF': 'French Polynesia',
    'PG': 'Papua New Guinea',
    'PH': 'Philippines',
    'PK': 'Pakistan',
    'PL': 'Poland',
    'PM': 'St. Pierre & Miquelon',
    'PN': 'Pitcairn',
    'PR': 'Puerto Rico',
    'PT': 'Portugal',
    'PW': 'Palau',
    'PY': 'Paraguay',
    'QA': 'Qatar',
    'RE': 'Réunion',
    'RO': 'Romania',
    'RU': 'Russian Federation',
    'RW': 'Rwanda',
    'SA': 'Saudi Arabia',
    'SB': 'Solomon Islands',
    'SC': 'Seychelles',
    'SD': 'Sudan',
    'SE': 'Sweden',
    'SG': 'Singapore',
    'SH': 'St. Helena',
    'SI': 'Slovenia',
    'SJ': 'Svalbard & Jan Mayen Islands',
    'SK': 'Slovakia',
    'SL': 'Sierra Leone',
    'SM': 'San Marino',
    'SN': 'Senegal',
    'SO': 'Somalia',
    'SR': 'Suriname',
    'ST': 'Sao Tome & Principe',
    'SU': 'Union of Soviet Socialist Republics (no longer exists)',
    'SV': 'El Salvador',
    'SY': 'Syrian Arab Republic',
    'SZ': 'Swaziland',
    'TC': 'Turks & Caicos Islands',
    'TD': 'Chad',
    'TF': 'French Southern Territories',
    'TG': 'Togo',
    'TH': 'Thailand',
    'TJ': 'Tajikistan',
    'TK': 'Tokelau',
    'TM': 'Turkmenistan',
    'TN': 'Tunisia',
    'TO': 'Tonga',
    'TP': 'East Timor',
    'TR': 'Turkey',
    'TT': 'Trinidad & Tobago',
    'TV': 'Tuvalu',
    'TW': 'Taiwan, Province of China',
    'TZ': 'Tanzania, United Republic of',
    'UA': 'Ukraine',
    'UG': 'Uganda',
    'UM': 'United States Minor Outlying Islands',
    'US': 'United States of America',
    'UY': 'Uruguay',
    'UZ': 'Uzbekistan',
    'VA': 'Vatican City State (Holy See)',
    'VC': 'St. Vincent & the Grenadines',
    'VE': 'Venezuela',
    'VG': 'British Virgin Islands',
    'VI': 'United States Virgin Islands',
    'VN': 'Viet Nam',
    'VU': 'Vanuatu',
    'WF': 'Wallis & Futuna Islands',
    'WS': 'Samoa',
    'YD': 'Democratic Yemen (no longer exists)',
    'YE': 'Yemen',
    'YT': 'Mayotte',
    'YU': 'Yugoslavia',
    'ZA': 'South Africa',
    'ZM': 'Zambia',
    'ZR': 'Zaire',
    'ZW': 'Zimbabwe',
    'ZZ': 'Unknown or unspecified country',
}

# Three letter codes
#https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3#Uses_and_applications
alpha3 = """ABW  Aruba
AFG  Afghanistan
AGO  Angola
AIA  Anguilla
ALA  Åland Islands
ALB  Albania
AND  Andorra
ARE  United Arab Emirates
ARG  Argentina
ARM  Armenia
ASM  American Samoa
ATA  Antarctica
ATF  French Southern Territories
ATG  Antigua and Barbuda
AUS  Australia
AUT  Austria
AZE  Azerbaijan
BDI  Burundi
BEL  Belgium
BEN  Benin
BES  Bonaire, Sint Eustatius and Saba
BFA  Burkina Faso
BGD  Bangladesh
BGR  Bulgaria
BHR  Bahrain
BHS  Bahamas
BIH  Bosnia and Herzegovina
BLM  Saint Barthélemy
BLR  Belarus
BLZ  Belize
BMU  Bermuda
BOL  Bolivia (Plurinational State of)
BRA  Brazil
BRB  Barbados
BRN  Brunei Darussalam
BTN  Bhutan
BVT  Bouvet Island
BWA  Botswana
CAF  Central African Republic
CAN  Canada
CCK  Cocos (Keeling) Islands
CHE  Switzerland
CHL  Chile
CHN  China
CIV  Côte d'Ivoire
CMR  Cameroon
COD  Congo, Democratic Republic of the
COG  Congo
COK  Cook Islands
COL  Colombia
COM  Comoros
CPV  Cabo Verde
CRI  Costa Rica
CUB  Cuba
CUW  Curaçao
CXR  Christmas Island
CYM  Cayman Islands
CYP  Cyprus
CZE  Czechia
DEU  Germany
DJI  Djibouti
DMA  Dominica
DNK  Denmark
DOM  Dominican Republic
DZA  Algeria
ECU  Ecuador
EGY  Egypt
ERI  Eritrea
ESH  Western Sahara
ESP  Spain
EST  Estonia
ETH  Ethiopia
FIN  Finland
FJI  Fiji
FLK  Falkland Islands (Malvinas)
FRA  France
FRO  Faroe Islands
FSM  Micronesia (Federated States of)
GAB  Gabon
GBR  United Kingdom of Great Britain and Northern Ireland
GEO  Georgia
GGY  Guernsey
GHA  Ghana
GIB  Gibraltar
GIN  Guinea
GLP  Guadeloupe
GMB  Gambia
GNB  Guinea-Bissau
GNQ  Equatorial Guinea
GRC  Greece
GRD  Grenada
GRL  Greenland
GTM  Guatemala
GUF  French Guiana
GUM  Guam
GUY  Guyana
HKG  Hong Kong
HMD  Heard Island and McDonald Islands
HND  Honduras
HRV  Croatia
HTI  Haiti
HUN  Hungary
IDN  Indonesia
IMN  Isle of Man
IND  India
IOT  British Indian Ocean Territory
IRL  Ireland
IRN  Iran (Islamic Republic of)
IRQ  Iraq
ISL  Iceland
ISR  Israel
ITA  Italy
JAM  Jamaica
JEY  Jersey
JOR  Jordan
JPN  Japan
KAZ  Kazakhstan
KEN  Kenya
KGZ  Kyrgyzstan
KHM  Cambodia
KIR  Kiribati
KNA  Saint Kitts and Nevis
KOR  Korea, Republic of
KWT  Kuwait
LAO  Lao People's Democratic Republic
LBN  Lebanon
LBR  Liberia
LBY  Libya
LCA  Saint Lucia
LIE  Liechtenstein
LKA  Sri Lanka
LSO  Lesotho
LTU  Lithuania
LUX  Luxembourg
LVA  Latvia
MAC  Macao
MAF  Saint Martin (French part)
MAR  Morocco
MCO  Monaco
MDA  Moldova, Republic of
MDG  Madagascar
MDV  Maldives
MEX  Mexico
MHL  Marshall Islands
MKD  North Macedonia
MLI  Mali
MLT  Malta
MMR  Myanmar
MNE  Montenegro
MNG  Mongolia
MNP  Northern Mariana Islands
MOZ  Mozambique
MRT  Mauritania
MSR  Montserrat
MTQ  Martinique
MUS  Mauritius
MWI  Malawi
MYS  Malaysia
MYT  Mayotte
NAM  Namibia
NCL  New Caledonia
NER  Niger
NFK  Norfolk Island
NGA  Nigeria
NIC  Nicaragua
NIU  Niue
NLD  Netherlands
NOR  Norway
NPL  Nepal
NRU  Nauru
NZL  New Zealand
OMN  Oman
PAK  Pakistan
PAN  Panama
PCN  Pitcairn
PER  Peru
PHL  Philippines
PLW  Palau
PNG  Papua New Guinea
POL  Poland
PRI  Puerto Rico
PRK  Korea (Democratic People's Republic of)
PRT  Portugal
PRY  Paraguay
PSE  Palestine, State of
PYF  French Polynesia
QAT  Qatar
REU  Réunion
ROU  Romania
RUS  Russian Federation
RWA  Rwanda
SAU  Saudi Arabia
SDN  Sudan
SEN  Senegal
SGP  Singapore
SGS  South Georgia and the South Sandwich Islands
SHN  Saint Helena, Ascension and Tristan da Cunha
SJM  Svalbard and Jan Mayen
SLB  Solomon Islands
SLE  Sierra Leone
SLV  El Salvador
SMR  San Marino
SOM  Somalia
SPM  Saint Pierre and Miquelon
SRB  Serbia
SSD  South Sudan
STP  Sao Tome and Principe
SUR  Suriname
SVK  Slovakia
SVN  Slovenia
SWE  Sweden
SWZ  Eswatini
SXM  Sint Maarten (Dutch part)
SYC  Seychelles
SYR  Syrian Arab Republic
TCA  Turks and Caicos Islands
TCD  Chad
TGO  Togo
THA  Thailand
TJK  Tajikistan
TKL  Tokelau
TKM  Turkmenistan
TLS  Timor-Leste
TON  Tonga
TTO  Trinidad and Tobago
TUN  Tunisia
TUR  Turkey
TUV  Tuvalu
TWN  Taiwan, Province of China
TZA  Tanzania, United Republic of
UGA  Uganda
UKR  Ukraine
UMI  United States Minor Outlying Islands
URY  Uruguay
USA  United States of America
UZB  Uzbekistan
VAT  Holy See
VCT  Saint Vincent and the Grenadines
VEN  Venezuela (Bolivarian Republic of)
VGB  Virgin Islands (British)
VIR  Virgin Islands (U.S.)
VNM  Viet Nam
VUT  Vanuatu
WLF  Wallis and Futuna
WSM  Samoa
YEM  Yemen
ZAF  South Africa
ZMB  Zambia
ZWE  Zimbabwe"""

# Convert to dictionary
alpha3 = dict(tuple(re.split(r" {2,}", s)) for s in alpha3.split('\n'))

# List of World Cities & Country
# cities https://pkgstore.datahub.io/core/world-cities/world-cities_csv/data/6cc66692f0e82b18216a48443b6b95da/world-cities_csv.csv
# Online CSV File

import csv
import urllib.request
import io

def csv_import(url):
    url_open = urllib.request.urlopen(url)
    csvfile = csv.DictReader(io.StringIO(url_open.read().decode('utf-8')), delimiter=',') 
    return csvfile

url = 'https://pkgstore.datahub.io/core/world-cities/world-cities_csv/data/6cc66692f0e82b18216a48443b6b95da/world-cities_csv.csv'

cities = csv_import(url)

测试

Excel 文件(输入)

country city
u.s.    
DZ  
AS  
co  Longmont
co  Bogota
AL  
AL  Huntsville
usa 
AFG 
BLR Minsk
AUS 
united states   
Korea   seoul
Korea   Pyongyang

测试码

df = pd.read_excel('country_test.xlsx') # Load Excel File
df.fillna('', inplace=True)

# Get name of country based upon country and city
df['country_'] = df.apply(lambda row: lookup(row['country'], row['city']), axis = 1)

结果数据框

       country        city                  country_
0            u.s.              United States of America
1              DZ                               Algeria
2              AS                        American Samoa
3              co    Longmont             United States
4              co      Bogota                  Colombia
5              AL                               Albania
6              AL  Huntsville             United States
7             usa              United States of America
8             AFG                           Afghanistan
9             BLR       Minsk                   Belarus
10            AUS                             Australia
11  united states              United States of America
12          Korea       seoul               South Korea
13          Korea   Pyongyang               North Korea