合并两个数据框 - 条件行
Merging two dataframes - conditional rows
我现在正在处理数据,我有一个 csv 文件,来自 Indian Food Prices 中的 kaggle,我已经将其转换成一个数据框,其中一列 'Market' 有一些分类位置数据。对于 ML 问题,我需要通过开放街道地图 api 设法获得的纬度和经度数据 - 但是由于原始数据框有 131000 多行,我创建了一个单独的数据框,其中只有一个位置实例和这个新数据框有 165 行。
我现在需要合并这两个数据帧,但想出一个循环,该循环将使用 131000 多行的原始数据帧中的所有行填充来自 165 行的较小数据帧中的纬度和经度数据,但是纬度和经度与 'Market' 列中的位置匹配。
如有任何建议,我们将不胜感激,
这是尝试实现上述目标的尝试
def using_where(ndf):
ndf['Lat-Long'] = np.where(df['Market']='Delhi', '28.6517178, 77.2219388'
这是我的大数据框的头部'ndf'
<bound method NDFrame.head of Unnamed: 0 Date Market Category
\
0 1 1994-01-15 Delhi cereals and tubers
1 2 1994-01-15 Delhi cereals and tubers
2 3 1994-01-15 Delhi miscellaneous food
3 4 1994-01-15 Delhi oil and fats
4 5 1994-01-15 Ahmedabad cereals and tubers
... ... ... ... ...
139529 139530 2021-09-15 Kharagpur pulses and nuts
139530 139531 2021-09-15 Kharagpur pulses and nuts
139531 139532 2021-09-15 Kharagpur pulses and nuts
139532 139533 2021-09-15 Kharagpur vegetables and fruits
139533 139534 2021-09-15 Kharagpur vegetables and fruits
Commodity Unit PriceFlag PriceType Currency Price USD_Price
0 Rice KG actual Retail INR 8.0 0.2545
1 Wheat KG actual Retail INR 5.0 0.1590
2 Sugar KG actual Retail INR 13.5 0.4294
3 Oil (mustard) KG actual Retail INR 31.0 0.9860
4 Rice KG actual Retail INR 6.8 0.2163
... ... ... ... ... ... ... ...
139529 Lentils (masur) KG actual Retail INR 110.0 1.4972
139530 Lentils (moong) KG actual Retail INR 120.0 1.6333
139531 Lentils (urad) KG actual Retail INR 115.0 1.5653
139532 Onions KG actual Retail INR 30.0 0.4083
139533 Tomatoes KG actual Retail INR 40.0 0.5444
这是我的小数据框的头部'df'
<bound method NDFrame.head of Market geocoded
0 Delhi (28.6517178, 77.2219388)
4 Ahmedabad (23.0216238, 72.5797068)
8 Shimla (31.1041526, 77.1709729)
11 Bengaluru (12.9767936, 77.590082)
14 Bhopal (23.2584857, 77.401989)
... ... ...
136823 Dantewada (18.8640648, 81.38339468738648)
136970 Selamba -1
137053 Bodeli (22.2748105, 73.7166363)
137326 Dhanbad (23.7952809, 86.4309638)
137389 Jamshedpur (22.8015194, 86.2029579)
[165 rows x 2 columns]>
我想你可以直接使用 merge()
,除非我遗漏了什么:
ndf = pd.merge(ndf, df, how='inner', on='Market')
这里有一个带有测试用例的完整代码示例:
import pandas as pd
ndf = pd.DataFrame({'Date':['1994-01-15']*5 + ['2021-09-15']*5, 'Market':'Delhi,Delhi,Delhi,Delhi,Ahmedabad,Kharagpur,Kharagpur,Kharagpur,Kharagpur,Kharagpur'.split(','),
'Category':'cereals and tubers,cereals and tubers,miscellaneous food,oil and fats,cereals and tubers,pulses and nuts,pulses and nuts,pulses and nuts,vegetables and fruits,vegetables and fruits'.split(',')})
df = pd.DataFrame({'Market':'Delhi,Ahmedabad,Shimla,Bengaluru,Bhopal,Kharagpur'.split(','),
'geocoded':[(28.6517178, 77.2219388),(23.0216238, 72.5797068),(31.1041526, 77.1709729),(12.9767936, 77.590082),(23.2584857, 77.401989),(22.22, 73.73)]})
ndf = pd.merge(ndf, df, how='inner', on='Market')
print(ndf)
输出:
Date Market Category geocoded
0 1994-01-15 Delhi cereals and tubers (28.6517178, 77.2219388)
1 1994-01-15 Delhi cereals and tubers (28.6517178, 77.2219388)
2 1994-01-15 Delhi miscellaneous food (28.6517178, 77.2219388)
3 1994-01-15 Delhi oil and fats (28.6517178, 77.2219388)
4 1994-01-15 Ahmedabad cereals and tubers (23.0216238, 72.5797068)
5 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
6 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
7 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
8 2021-09-15 Kharagpur vegetables and fruits (22.22, 73.73)
9 2021-09-15 Kharagpur vegetables and fruits (22.22, 73.73)
我现在正在处理数据,我有一个 csv 文件,来自 Indian Food Prices 中的 kaggle,我已经将其转换成一个数据框,其中一列 'Market' 有一些分类位置数据。对于 ML 问题,我需要通过开放街道地图 api 设法获得的纬度和经度数据 - 但是由于原始数据框有 131000 多行,我创建了一个单独的数据框,其中只有一个位置实例和这个新数据框有 165 行。
我现在需要合并这两个数据帧,但想出一个循环,该循环将使用 131000 多行的原始数据帧中的所有行填充来自 165 行的较小数据帧中的纬度和经度数据,但是纬度和经度与 'Market' 列中的位置匹配。
如有任何建议,我们将不胜感激,
这是尝试实现上述目标的尝试
def using_where(ndf):
ndf['Lat-Long'] = np.where(df['Market']='Delhi', '28.6517178, 77.2219388'
这是我的大数据框的头部'ndf'
<bound method NDFrame.head of Unnamed: 0 Date Market Category
\
0 1 1994-01-15 Delhi cereals and tubers
1 2 1994-01-15 Delhi cereals and tubers
2 3 1994-01-15 Delhi miscellaneous food
3 4 1994-01-15 Delhi oil and fats
4 5 1994-01-15 Ahmedabad cereals and tubers
... ... ... ... ...
139529 139530 2021-09-15 Kharagpur pulses and nuts
139530 139531 2021-09-15 Kharagpur pulses and nuts
139531 139532 2021-09-15 Kharagpur pulses and nuts
139532 139533 2021-09-15 Kharagpur vegetables and fruits
139533 139534 2021-09-15 Kharagpur vegetables and fruits
Commodity Unit PriceFlag PriceType Currency Price USD_Price
0 Rice KG actual Retail INR 8.0 0.2545
1 Wheat KG actual Retail INR 5.0 0.1590
2 Sugar KG actual Retail INR 13.5 0.4294
3 Oil (mustard) KG actual Retail INR 31.0 0.9860
4 Rice KG actual Retail INR 6.8 0.2163
... ... ... ... ... ... ... ...
139529 Lentils (masur) KG actual Retail INR 110.0 1.4972
139530 Lentils (moong) KG actual Retail INR 120.0 1.6333
139531 Lentils (urad) KG actual Retail INR 115.0 1.5653
139532 Onions KG actual Retail INR 30.0 0.4083
139533 Tomatoes KG actual Retail INR 40.0 0.5444
这是我的小数据框的头部'df'
<bound method NDFrame.head of Market geocoded
0 Delhi (28.6517178, 77.2219388)
4 Ahmedabad (23.0216238, 72.5797068)
8 Shimla (31.1041526, 77.1709729)
11 Bengaluru (12.9767936, 77.590082)
14 Bhopal (23.2584857, 77.401989)
... ... ...
136823 Dantewada (18.8640648, 81.38339468738648)
136970 Selamba -1
137053 Bodeli (22.2748105, 73.7166363)
137326 Dhanbad (23.7952809, 86.4309638)
137389 Jamshedpur (22.8015194, 86.2029579)
[165 rows x 2 columns]>
我想你可以直接使用 merge()
,除非我遗漏了什么:
ndf = pd.merge(ndf, df, how='inner', on='Market')
这里有一个带有测试用例的完整代码示例:
import pandas as pd
ndf = pd.DataFrame({'Date':['1994-01-15']*5 + ['2021-09-15']*5, 'Market':'Delhi,Delhi,Delhi,Delhi,Ahmedabad,Kharagpur,Kharagpur,Kharagpur,Kharagpur,Kharagpur'.split(','),
'Category':'cereals and tubers,cereals and tubers,miscellaneous food,oil and fats,cereals and tubers,pulses and nuts,pulses and nuts,pulses and nuts,vegetables and fruits,vegetables and fruits'.split(',')})
df = pd.DataFrame({'Market':'Delhi,Ahmedabad,Shimla,Bengaluru,Bhopal,Kharagpur'.split(','),
'geocoded':[(28.6517178, 77.2219388),(23.0216238, 72.5797068),(31.1041526, 77.1709729),(12.9767936, 77.590082),(23.2584857, 77.401989),(22.22, 73.73)]})
ndf = pd.merge(ndf, df, how='inner', on='Market')
print(ndf)
输出:
Date Market Category geocoded
0 1994-01-15 Delhi cereals and tubers (28.6517178, 77.2219388)
1 1994-01-15 Delhi cereals and tubers (28.6517178, 77.2219388)
2 1994-01-15 Delhi miscellaneous food (28.6517178, 77.2219388)
3 1994-01-15 Delhi oil and fats (28.6517178, 77.2219388)
4 1994-01-15 Ahmedabad cereals and tubers (23.0216238, 72.5797068)
5 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
6 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
7 2021-09-15 Kharagpur pulses and nuts (22.22, 73.73)
8 2021-09-15 Kharagpur vegetables and fruits (22.22, 73.73)
9 2021-09-15 Kharagpur vegetables and fruits (22.22, 73.73)