Python Pandas 比较两个数据帧以将国家分配给 phone 号码
Python Pandas compare two dataframes to assign country to phone number
我有两个通过 csv 读入的数据帧。数据框一由一个 phone 数字和一些附加数据组成。第二个数据框包含国家代码和国家名称。
我想从第一个数据集中获取 phone 数字并将其与第二个数据集中的国家/地区代码进行比较。国家/地区代码的长度可以在 1 到 4 位之间。我从最长的国家代码到最短的国家代码。如果匹配,我想将国家名称分配给 phone 号码。
输入入围名单:
phonenumber, add_info
34123425209, info1
92654321762, info2
12018883637, info3
6323450001, info4
496789521134, info5
输入country_list:
country;country_code;order_info
Spain;34;1
Pakistan;92;4
USA;1;2
Philippines;63;3
Germany;49;4
Poland;48;1
Norway;47;2
输出应该是:
phonenumber, add_info, country, order_info
34123425209, info1, Spain, 1
92654321762, info2, Pakistan, 4
12018883637, info3, USA, 2
6323450001, info4, Philippines, 3
496789521134, info5, Germany, 4
我曾经这样解决过一次:
#! /usr/bin/python
import csv
import pandas
with open ('longlist.csv','r') as lookuplist:
with open ('country_list.csv','r') as inputlist:
with open('Outputfile.csv', 'w') as outputlist:
reader = csv.reader(lookuplist, delimiter=',')
reader2 = csv.reader(inputlist, delimiter=';')
writer = csv.writer(outputlist, dialect='excel')
for i in reader2:
for xl in reader:
if xl[0].startswith(i[1]):
zeile = [xl[0], xl[1], i[0], i[1], i[2]]
writer.writerow(zeile)
lookuplist.seek(0)
但是我想解决这个问题,使用pandas。我要做的工作:
- 读取 csv 文件
- 从 "longlist" 中删除重复项
- 国家/国家代码排序列表
这是我已经在做的事情:
import pandas as pd, numpy as np
longlist = pd.read_csv('path/to/longlist.csv',
usecols=[2,3], names=['PHONENUMBER','ADD_INFO'])
country_list = pd.read_csv('path/to/country_list.csv',
sep=';', names=['COUNTRY','COUNTRY_CODE','ORDER_INFO'], skiprows=[0])
# remove duplicates and make phone number an index
longlist = longlist.drop_duplicates('PHONENUMBER')
longlist = longlist.set_index('PHONENUMBER')
# Sort country list, from high to low value and make country code an index
country_list=country_list.sort_values(by='COUNTRY_CODE', ascending=0)
country_list=country_list.set_index('COUNTRY_CODE')
(...)
longlist.to_csv('path/to/output.csv')
但是尝试对数据集进行相同的任何方式都行不通。我无法应用 startswith(无法遍历对象,也无法将其应用于对象)。我将衷心感谢您的帮助。
我会这样做:
cl = pd.read_csv('country_list.csv', sep=';', dtype={'country_code':str})
ll = pd.read_csv('phones.csv', skipinitialspace=True, dtype={'phonenumber':str})
lookup = cl['country_code']
lookup.index = cl['country_code']
ll['country_code'] = (
ll['phonenumber']
.apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
lookup.get(x[:2]), lookup.get(x[:1])]))
.apply(lambda x: x.get(x.first_valid_index()), axis=1)
)
# remove `how='left'` parameter if you don't need "unmatched" phone-numbers
result = ll.merge(cl, on='country_code', how='left')
输出:
In [195]: result
Out[195]:
phonenumber add_info country_code country order_info
0 34123425209 info1 34 Spain 1.0
1 92654321762 info2 92 Pakistan 4.0
2 12018883637 info3 1 USA 2.0
3 12428883637 info31 1242 Bahamas 3.0
4 6323450001 info4 63 Philippines 3.0
5 496789521134 info5 49 Germany 4.0
6 00000000000 BAD None NaN NaN
解释:
In [216]: (ll['phonenumber']
.....: .apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
.....: lookup.get(x[:2]), lookup.get(x[:1])]))
.....: )
Out[216]:
0 1 2 3
0 None None 34 None
1 None None 92 None
2 None None None 1
3 1242 None None 1
4 None None 63 None
5 None None 49 None
6 None None None None
phones.csv: - 我特意添加了一个巴哈马号码 (1242...
) 和一个无效号码 (00000000000
)
phonenumber, add_info
34123425209, info1
92654321762, info2
12018883637, info3
12428883637, info31
6323450001, info4
496789521134, info5
00000000000, BAD
我有两个通过 csv 读入的数据帧。数据框一由一个 phone 数字和一些附加数据组成。第二个数据框包含国家代码和国家名称。
我想从第一个数据集中获取 phone 数字并将其与第二个数据集中的国家/地区代码进行比较。国家/地区代码的长度可以在 1 到 4 位之间。我从最长的国家代码到最短的国家代码。如果匹配,我想将国家名称分配给 phone 号码。
输入入围名单:
phonenumber, add_info
34123425209, info1
92654321762, info2
12018883637, info3
6323450001, info4
496789521134, info5
输入country_list:
country;country_code;order_info
Spain;34;1
Pakistan;92;4
USA;1;2
Philippines;63;3
Germany;49;4
Poland;48;1
Norway;47;2
输出应该是:
phonenumber, add_info, country, order_info
34123425209, info1, Spain, 1
92654321762, info2, Pakistan, 4
12018883637, info3, USA, 2
6323450001, info4, Philippines, 3
496789521134, info5, Germany, 4
我曾经这样解决过一次:
#! /usr/bin/python
import csv
import pandas
with open ('longlist.csv','r') as lookuplist:
with open ('country_list.csv','r') as inputlist:
with open('Outputfile.csv', 'w') as outputlist:
reader = csv.reader(lookuplist, delimiter=',')
reader2 = csv.reader(inputlist, delimiter=';')
writer = csv.writer(outputlist, dialect='excel')
for i in reader2:
for xl in reader:
if xl[0].startswith(i[1]):
zeile = [xl[0], xl[1], i[0], i[1], i[2]]
writer.writerow(zeile)
lookuplist.seek(0)
但是我想解决这个问题,使用pandas。我要做的工作: - 读取 csv 文件 - 从 "longlist" 中删除重复项 - 国家/国家代码排序列表
这是我已经在做的事情:
import pandas as pd, numpy as np
longlist = pd.read_csv('path/to/longlist.csv',
usecols=[2,3], names=['PHONENUMBER','ADD_INFO'])
country_list = pd.read_csv('path/to/country_list.csv',
sep=';', names=['COUNTRY','COUNTRY_CODE','ORDER_INFO'], skiprows=[0])
# remove duplicates and make phone number an index
longlist = longlist.drop_duplicates('PHONENUMBER')
longlist = longlist.set_index('PHONENUMBER')
# Sort country list, from high to low value and make country code an index
country_list=country_list.sort_values(by='COUNTRY_CODE', ascending=0)
country_list=country_list.set_index('COUNTRY_CODE')
(...)
longlist.to_csv('path/to/output.csv')
但是尝试对数据集进行相同的任何方式都行不通。我无法应用 startswith(无法遍历对象,也无法将其应用于对象)。我将衷心感谢您的帮助。
我会这样做:
cl = pd.read_csv('country_list.csv', sep=';', dtype={'country_code':str})
ll = pd.read_csv('phones.csv', skipinitialspace=True, dtype={'phonenumber':str})
lookup = cl['country_code']
lookup.index = cl['country_code']
ll['country_code'] = (
ll['phonenumber']
.apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
lookup.get(x[:2]), lookup.get(x[:1])]))
.apply(lambda x: x.get(x.first_valid_index()), axis=1)
)
# remove `how='left'` parameter if you don't need "unmatched" phone-numbers
result = ll.merge(cl, on='country_code', how='left')
输出:
In [195]: result
Out[195]:
phonenumber add_info country_code country order_info
0 34123425209 info1 34 Spain 1.0
1 92654321762 info2 92 Pakistan 4.0
2 12018883637 info3 1 USA 2.0
3 12428883637 info31 1242 Bahamas 3.0
4 6323450001 info4 63 Philippines 3.0
5 496789521134 info5 49 Germany 4.0
6 00000000000 BAD None NaN NaN
解释:
In [216]: (ll['phonenumber']
.....: .apply(lambda x: pd.Series([lookup.get(x[:4]), lookup.get(x[:3]),
.....: lookup.get(x[:2]), lookup.get(x[:1])]))
.....: )
Out[216]:
0 1 2 3
0 None None 34 None
1 None None 92 None
2 None None None 1
3 1242 None None 1
4 None None 63 None
5 None None 49 None
6 None None None None
phones.csv: - 我特意添加了一个巴哈马号码 (1242...
) 和一个无效号码 (00000000000
)
phonenumber, add_info
34123425209, info1
92654321762, info2
12018883637, info3
12428883637, info31
6323450001, info4
496789521134, info5
00000000000, BAD