Left join pandas 如果列值在一定范围内?

Left join pandas if column value is within a certain range?

我想知道如果两个数据集的值在彼此的某个范围内,是否可以合并两个数据集。

例如,如果我想加入邮政编码,则不需要完全匹配,如果左侧 table 邮政编码在右侧 [=15] 的范围内,则会发生连接=27=]邮政编码。

这里有一些示例数据和图表来说明这一点。

# Create sample dataframe

data1 = [[10001, 'NY'], [10007, 'NY'], [10013, 'NY'], [90011, 'CA'], [91331, 'CA'], [90650, 'CA']]

data2 = [[10003, 'NY', 1200], [10008, 'NY', 1460], [10010, 'NY', 1900], [90011, 'CA', 850], [91315, 'CA', 1700], [90645, 'CA',2300]]

df_left = pd.DataFrame(data1, columns = ['Zip', 'State'])
df_right = pd.DataFrame(data2, columns = ['Zip', 'State', 'Average_Rent'])

print(df_left.head())
print(df_right.head())

# Merge
df_merge = df_left.merge(df_right, left_on='Zip', right_on = 'Zip', how='left')
print(df_merge)

# Want to merge if within a 5 zipcode radius. If two zips nearby, then choose the first observation.
data3 = [[10001, 'NY', 10003, 12000], [10007, 'NY', 10008, 1460], [10007, 'NY', 10010, 1900], [10013, 'NY', 'NaN', 'NaN'],
         [90011, 'CA', 90011, 850], [91331, 'CA', 'NaN', 'NaN'], [90650, 'CA', 90645, 2300]]

df_want = pd.DataFrame(data3, columns = ['Zip_left', 'State', 'Zip_right', 'Rent'])
df_want.head(6)

结果如下所示:

传统的左连接会给我留下最上面的结果,而我想要的结果显示在底部(不关心输出什么列,我只是当场输入)。

我要执行的主要规则是间隔合并。对于决胜局,此示例选择了第一个,但即使右侧 table 的邮政编码 10010 匹配了两次:一次匹配到 10007,一次匹配到 10013,这也不是问题。坦率地说,只要至少进行一次合并,决胜局规则就不会打扰我。

从 pandas 1.2.0. 开始,您可以交叉 merge,这会从两个 DataFrame 中创建笛卡尔积。因此,交叉合并并过滤状态匹配的列。然后找到邮政编码之间的绝对差异,并用它来识别每个“Zip_left”距离最近的行。最后,mask差异大于15的行(即使最接近),所以我们用NaN填充它们:

merged = df_left.merge(df_right, how='cross', suffixes=('_left', '_right'))
merged = merged[merged['State_left']==merged['State_right']]
merged['Diff'] = merged['Zip_left'].sub(merged['Zip_right']).abs()
merged = merged[merged.groupby('Zip_left')['Diff'].transform('min') == merged['Diff']]
cols = merged.columns[~merged.columns.str.endswith('left')]
merged[cols] = merged[cols].mask(merged['Diff']>15)
out = merged.drop(columns=['State_right','Diff']).rename(columns={'State_left':'State'}).reset_index(drop=True)

输出:

   Zip_left State  Zip_right  Average_Rent
0     10001    NY    10003.0        1200.0
1     10007    NY    10008.0        1460.0
2     10013    NY    10010.0        1900.0
3     90011    CA    90011.0         850.0
4     91331    CA        NaN           NaN
5     90650    CA    90645.0        2300.0

一个选项是 conditional_join from pyjanitor,它试图避免笛卡尔连接,并提高内存性能。

我也借鉴了enke的解决方案(他比我更能理解你的问题):

# install latest from dev
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

为 df_right

创建开始和结束的边界
df_right = (df_right
            .assign(
                start = df_right.Zip - 15, 
                end = df_right.Zip + 15)
            .rename(columns = {'Zip':'Zip_right', 
                               'State':'State_right'})
            )

计算conditional_join:

out = (df_left
      .conditional_join(
          df_right,
          ('State', 'State_right', '=='),
          ('Zip', 'start', '>='),
          ('Zip', 'end', '<='),
          how = 'left')
      .assign(diff_ = lambda df: df.Zip.sub(df.Zip_right).abs())
       )

获取最小值diff_的行:

(
out
# you can get better performance
# if you skip the sort
# and just grab the fist match
.sort_values(['Zip', 'diff_'])
.groupby('Zip', sort = False)
.nth(0)
.loc[:, ['State', 'Zip_right', 'Average_Rent']]
.reset_index()
)

     Zip State  Zip_right  Average_Rent
0  10001    NY    10003.0        1200.0
1  10007    NY    10008.0        1460.0
2  10013    NY    10010.0        1900.0
3  90011    CA    90011.0         850.0
4  90650    CA    90645.0        2300.0
5  91331    CA        NaN           NaN