Python Pandas 比较两个数据帧以将国家分配给 phone 号码

Question

我有两个通过 csv 读入的数据帧。数据框一由一个 phone 数字和一些附加数据组成。第二个数据框包含国家代码和国家名称。

我想从第一个数据集中获取 phone 数字并将其与第二个数据集中的国家/地区代码进行比较。国家/地区代码的长度可以在 1 到 4 位之间。我从最长的国家代码到最短的国家代码。如果匹配，我想将国家名称分配给 phone 号码。

输入入围名单：

phonenumber, add_info    
34123425209, info1
92654321762, info2
12018883637, info3
6323450001, info4
496789521134, info5

输入country_list:

country;country_code;order_info
Spain;34;1
Pakistan;92;4
USA;1;2
Philippines;63;3
Germany;49;4
Poland;48;1
Norway;47;2

输出应该是：

phonenumber, add_info, country, order_info    
34123425209, info1, Spain, 1
92654321762, info2, Pakistan, 4
12018883637, info3, USA, 2
6323450001, info4, Philippines, 3
496789521134, info5, Germany, 4

我曾经这样解决过一次：

#! /usr/bin/python
import csv
import pandas
with open ('longlist.csv','r') as lookuplist:
with open ('country_list.csv','r') as inputlist:
    with open('Outputfile.csv', 'w') as outputlist:
        reader = csv.reader(lookuplist, delimiter=',')
        reader2 = csv.reader(inputlist, delimiter=';')
        writer = csv.writer(outputlist, dialect='excel')

        for i in reader2:
            for xl in reader:
                if xl[0].startswith(i[1]):
                    zeile = [xl[0], xl[1], i[0], i[1], i[2]]
                    writer.writerow(zeile)
            lookuplist.seek(0)

但是我想解决这个问题，使用pandas。我要做的工作： - 读取 csv 文件 - 从 "longlist" 中删除重复项 - 国家/国家代码排序列表

这是我已经在做的事情：

import pandas as pd, numpy as np
longlist = pd.read_csv('path/to/longlist.csv', 
                                 usecols=[2,3], names=['PHONENUMBER','ADD_INFO'])
country_list = pd.read_csv('path/to/country_list.csv', 
                           sep=';', names=['COUNTRY','COUNTRY_CODE','ORDER_INFO'], skiprows=[0])

# remove duplicates and make phone number an index
longlist = longlist.drop_duplicates('PHONENUMBER')
longlist = longlist.set_index('PHONENUMBER')

# Sort country list, from high to low value and make country code an index
country_list=country_list.sort_values(by='COUNTRY_CODE', ascending=0)
country_list=country_list.set_index('COUNTRY_CODE')

(...)

longlist.to_csv('path/to/output.csv')

但是尝试对数据集进行相同的任何方式都行不通。我无法应用 startswith（无法遍历对象，也无法将其应用于对象）。我将衷心感谢您的帮助。

Answer 1

我会这样做：

cl = pd.read_csv('country_list.csv', sep=';', dtype={'country_code':str})
ll = pd.read_csv('phones.csv', skipinitialspace=True, dtype={'phonenumber':str})

lookup = cl['country_code']
lookup.index = cl['country_code']

ll['country_code'] = (
    ll['phonenumber']
    .apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
                                lookup.get(x[:2]), lookup.get(x[:1])]))
    .apply(lambda x: x.get(x.first_valid_index()), axis=1)
)

# remove `how='left'` parameter if you don't need "unmatched" phone-numbers    
result = ll.merge(cl, on='country_code', how='left')

输出：

In [195]: result
Out[195]:
    phonenumber add_info country_code      country  order_info
0   34123425209    info1           34        Spain         1.0
1   92654321762    info2           92     Pakistan         4.0
2   12018883637    info3            1          USA         2.0
3   12428883637   info31         1242      Bahamas         3.0
4    6323450001    info4           63  Philippines         3.0
5  496789521134    info5           49      Germany         4.0
6   00000000000      BAD         None          NaN         NaN

解释：

In [216]: (ll['phonenumber']
   .....:   .apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
   .....:                               lookup.get(x[:2]), lookup.get(x[:1])]))
   .....: )
Out[216]:
      0     1     2     3
0  None  None    34  None
1  None  None    92  None
2  None  None  None     1
3  1242  None  None     1
4  None  None    63  None
5  None  None    49  None
6  None  None  None  None

phones.csv: - 我特意添加了一个巴哈马号码 (1242...) 和一个无效号码 (00000000000)

phonenumber, add_info
34123425209, info1
92654321762, info2
12018883637, info3
12428883637, info31
6323450001, info4
496789521134, info5
00000000000, BAD

Python Pandas 比较两个数据帧以将国家分配给 phone 号码

Python Pandas compare two dataframes to assign country to phone number

python

startswith

dataframe

pandas