Left join pandas 如果列值在一定范围内?
Left join pandas if column value is within a certain range?
我想知道如果两个数据集的值在彼此的某个范围内,是否可以合并两个数据集。
例如,如果我想加入邮政编码,则不需要完全匹配,如果左侧 table 邮政编码在右侧 [=15] 的范围内,则会发生连接=27=]邮政编码。
这里有一些示例数据和图表来说明这一点。
# Create sample dataframe
data1 = [[10001, 'NY'], [10007, 'NY'], [10013, 'NY'], [90011, 'CA'], [91331, 'CA'], [90650, 'CA']]
data2 = [[10003, 'NY', 1200], [10008, 'NY', 1460], [10010, 'NY', 1900], [90011, 'CA', 850], [91315, 'CA', 1700], [90645, 'CA',2300]]
df_left = pd.DataFrame(data1, columns = ['Zip', 'State'])
df_right = pd.DataFrame(data2, columns = ['Zip', 'State', 'Average_Rent'])
print(df_left.head())
print(df_right.head())
# Merge
df_merge = df_left.merge(df_right, left_on='Zip', right_on = 'Zip', how='left')
print(df_merge)
# Want to merge if within a 5 zipcode radius. If two zips nearby, then choose the first observation.
data3 = [[10001, 'NY', 10003, 12000], [10007, 'NY', 10008, 1460], [10007, 'NY', 10010, 1900], [10013, 'NY', 'NaN', 'NaN'],
[90011, 'CA', 90011, 850], [91331, 'CA', 'NaN', 'NaN'], [90650, 'CA', 90645, 2300]]
df_want = pd.DataFrame(data3, columns = ['Zip_left', 'State', 'Zip_right', 'Rent'])
df_want.head(6)
结果如下所示:
传统的左连接会给我留下最上面的结果,而我想要的结果显示在底部(不关心输出什么列,我只是当场输入)。
我要执行的主要规则是间隔合并。对于决胜局,此示例选择了第一个,但即使右侧 table 的邮政编码 10010 匹配了两次:一次匹配到 10007,一次匹配到 10013,这也不是问题。坦率地说,只要至少进行一次合并,决胜局规则就不会打扰我。
从 pandas 1.2.0. 开始,您可以交叉 merge
,这会从两个 DataFrame 中创建笛卡尔积。因此,交叉合并并过滤状态匹配的列。然后找到邮政编码之间的绝对差异,并用它来识别每个“Zip_left”距离最近的行。最后,mask
差异大于15的行(即使最接近),所以我们用NaN填充它们:
merged = df_left.merge(df_right, how='cross', suffixes=('_left', '_right'))
merged = merged[merged['State_left']==merged['State_right']]
merged['Diff'] = merged['Zip_left'].sub(merged['Zip_right']).abs()
merged = merged[merged.groupby('Zip_left')['Diff'].transform('min') == merged['Diff']]
cols = merged.columns[~merged.columns.str.endswith('left')]
merged[cols] = merged[cols].mask(merged['Diff']>15)
out = merged.drop(columns=['State_right','Diff']).rename(columns={'State_left':'State'}).reset_index(drop=True)
输出:
Zip_left State Zip_right Average_Rent
0 10001 NY 10003.0 1200.0
1 10007 NY 10008.0 1460.0
2 10013 NY 10010.0 1900.0
3 90011 CA 90011.0 850.0
4 91331 CA NaN NaN
5 90650 CA 90645.0 2300.0
一个选项是 conditional_join from pyjanitor,它试图避免笛卡尔连接,并提高内存性能。
我也借鉴了enke的解决方案(他比我更能理解你的问题):
# install latest from dev
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
为 df_right
创建开始和结束的边界
df_right = (df_right
.assign(
start = df_right.Zip - 15,
end = df_right.Zip + 15)
.rename(columns = {'Zip':'Zip_right',
'State':'State_right'})
)
out = (df_left
.conditional_join(
df_right,
('State', 'State_right', '=='),
('Zip', 'start', '>='),
('Zip', 'end', '<='),
how = 'left')
.assign(diff_ = lambda df: df.Zip.sub(df.Zip_right).abs())
)
获取最小值diff_
的行:
(
out
# you can get better performance
# if you skip the sort
# and just grab the fist match
.sort_values(['Zip', 'diff_'])
.groupby('Zip', sort = False)
.nth(0)
.loc[:, ['State', 'Zip_right', 'Average_Rent']]
.reset_index()
)
Zip State Zip_right Average_Rent
0 10001 NY 10003.0 1200.0
1 10007 NY 10008.0 1460.0
2 10013 NY 10010.0 1900.0
3 90011 CA 90011.0 850.0
4 90650 CA 90645.0 2300.0
5 91331 CA NaN NaN
我想知道如果两个数据集的值在彼此的某个范围内,是否可以合并两个数据集。
例如,如果我想加入邮政编码,则不需要完全匹配,如果左侧 table 邮政编码在右侧 [=15] 的范围内,则会发生连接=27=]邮政编码。
这里有一些示例数据和图表来说明这一点。
# Create sample dataframe
data1 = [[10001, 'NY'], [10007, 'NY'], [10013, 'NY'], [90011, 'CA'], [91331, 'CA'], [90650, 'CA']]
data2 = [[10003, 'NY', 1200], [10008, 'NY', 1460], [10010, 'NY', 1900], [90011, 'CA', 850], [91315, 'CA', 1700], [90645, 'CA',2300]]
df_left = pd.DataFrame(data1, columns = ['Zip', 'State'])
df_right = pd.DataFrame(data2, columns = ['Zip', 'State', 'Average_Rent'])
print(df_left.head())
print(df_right.head())
# Merge
df_merge = df_left.merge(df_right, left_on='Zip', right_on = 'Zip', how='left')
print(df_merge)
# Want to merge if within a 5 zipcode radius. If two zips nearby, then choose the first observation.
data3 = [[10001, 'NY', 10003, 12000], [10007, 'NY', 10008, 1460], [10007, 'NY', 10010, 1900], [10013, 'NY', 'NaN', 'NaN'],
[90011, 'CA', 90011, 850], [91331, 'CA', 'NaN', 'NaN'], [90650, 'CA', 90645, 2300]]
df_want = pd.DataFrame(data3, columns = ['Zip_left', 'State', 'Zip_right', 'Rent'])
df_want.head(6)
结果如下所示:
传统的左连接会给我留下最上面的结果,而我想要的结果显示在底部(不关心输出什么列,我只是当场输入)。
我要执行的主要规则是间隔合并。对于决胜局,此示例选择了第一个,但即使右侧 table 的邮政编码 10010 匹配了两次:一次匹配到 10007,一次匹配到 10013,这也不是问题。坦率地说,只要至少进行一次合并,决胜局规则就不会打扰我。
从 pandas 1.2.0. 开始,您可以交叉 merge
,这会从两个 DataFrame 中创建笛卡尔积。因此,交叉合并并过滤状态匹配的列。然后找到邮政编码之间的绝对差异,并用它来识别每个“Zip_left”距离最近的行。最后,mask
差异大于15的行(即使最接近),所以我们用NaN填充它们:
merged = df_left.merge(df_right, how='cross', suffixes=('_left', '_right'))
merged = merged[merged['State_left']==merged['State_right']]
merged['Diff'] = merged['Zip_left'].sub(merged['Zip_right']).abs()
merged = merged[merged.groupby('Zip_left')['Diff'].transform('min') == merged['Diff']]
cols = merged.columns[~merged.columns.str.endswith('left')]
merged[cols] = merged[cols].mask(merged['Diff']>15)
out = merged.drop(columns=['State_right','Diff']).rename(columns={'State_left':'State'}).reset_index(drop=True)
输出:
Zip_left State Zip_right Average_Rent
0 10001 NY 10003.0 1200.0
1 10007 NY 10008.0 1460.0
2 10013 NY 10010.0 1900.0
3 90011 CA 90011.0 850.0
4 91331 CA NaN NaN
5 90650 CA 90645.0 2300.0
一个选项是 conditional_join from pyjanitor,它试图避免笛卡尔连接,并提高内存性能。
我也借鉴了enke的解决方案(他比我更能理解你的问题):
# install latest from dev
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
为 df_right
创建开始和结束的边界df_right = (df_right
.assign(
start = df_right.Zip - 15,
end = df_right.Zip + 15)
.rename(columns = {'Zip':'Zip_right',
'State':'State_right'})
)
out = (df_left
.conditional_join(
df_right,
('State', 'State_right', '=='),
('Zip', 'start', '>='),
('Zip', 'end', '<='),
how = 'left')
.assign(diff_ = lambda df: df.Zip.sub(df.Zip_right).abs())
)
获取最小值diff_
的行:
(
out
# you can get better performance
# if you skip the sort
# and just grab the fist match
.sort_values(['Zip', 'diff_'])
.groupby('Zip', sort = False)
.nth(0)
.loc[:, ['State', 'Zip_right', 'Average_Rent']]
.reset_index()
)
Zip State Zip_right Average_Rent
0 10001 NY 10003.0 1200.0
1 10007 NY 10008.0 1460.0
2 10013 NY 10010.0 1900.0
3 90011 CA 90011.0 850.0
4 90650 CA 90645.0 2300.0
5 91331 CA NaN NaN