我怎样才能确保一个数据帧中的变化被另一个数据帧中的类似变化所反映

How can I make sure that a change in one dataframe is reflected by a similar change in another dataframe

我正在使用两个与此类似的数据框:

import pandas as pd
records_A = pd.DataFrame({'id': [1001,1001,1001,1002,1002,1002,1003,1003,1003,1004,1005,1006], 
'location': ['NJ','OH','OH','CA','CA','CA','NJ','NJ','NC','PA','UT','AZ'],
                     'date': ['1/1','4/1','6/1','1/1','4/1','6/1','1/1','4/1','6/1','1/1','1/1','1/1']})

records_B = pd.DataFrame({'id': [1001,1001,1002,1003,1004,1005], 
'plan start': ['1/1','4/1','1/1','1/1','1/1','1/1'],
                     'plan end': ['3/31','12/31','12/31','12/31','12/31','12/31']})

我需要执行检查以确保 records_A 中的位置更改与 records_B 中的计划更改相对应,但我无法完全想象如何通过 Python 方式完成此操作。

例如,id 1001 在 4 月 1 日从 NJ 移动到 OH,如 records_A 所示。这对应于 records_B 中 4 月 1 日的计划开始日期,因此这是有效的。

与 id 1003 进行比较。此人于 6 月 1 日从新泽西州搬到北卡罗来纳州。但在records_B中,并没有对应于此位置移动的计划变更。我需要标记这个。

我已经开始通过分析 records_A 的位置变化来解决这个问题。在这里得到一些帮助后,我尝试创建一个位置更改列

records_A['location_change?']=np.where(records_A.groupby('id').location.apply(lambda x: x!=x.iloc[0]),'Changed','Unchanged')

这有助于识别 records_A 中的变化,但我不知道如何将这些结果与 records_B 进行比较。此外,对于拥有超过 2 条记录的人,它会将位置与原始位置进行比较,即使第 3 行的位置没有变化(1001 在 6 月 1 日没有移动)

有什么想法吗?

谢谢!

不确定我是否完全理解你想要什么,但你可以在 iddate 列上对 records_Aid 的两个数据帧进行外部连接plan start records_B。这样,records_B 中不存在的更改将自动用 NaN 值“标记”:

import pandas as pd

records_A = pd.DataFrame({
    'id': [1001,1001,1001,1002,1002,1002,1003,1003,1003,1004,1005,1006], 
    'location': ['NJ','OH','OH','CA','CA','CA','NJ','NJ','NC','PA','UT','AZ'],
    'date': ['1/1','4/1','6/1','1/1','4/1','6/1','1/1','4/1','6/1','1/1','1/1','1/1'],
})
records_B = pd.DataFrame({
    'id': [1001,1001,1002,1003,1004,1005], 
    'plan start': ['1/1','4/1','1/1','1/1','1/1','1/1'],
    'plan end': ['3/31','12/31','12/31','12/31','12/31','12/31'],
})

merged_df = records_A.merge(
    records_B,
    left_on=['id', 'date'],
    right_on=['id', 'plan start'],
    how='outer',
)
print(merged_df)

# output:
#       id location date plan start plan end
# 0   1001       NJ  1/1        1/1     3/31
# 1   1001       OH  4/1        4/1    12/31
# 2   1001       OH  6/1        NaN      NaN
# 3   1002       CA  1/1        1/1    12/31
# 4   1002       CA  4/1        NaN      NaN
# 5   1002       CA  6/1        NaN      NaN
# 6   1003       NJ  1/1        1/1    12/31
# 7   1003       NJ  4/1        NaN      NaN
# 8   1003       NC  6/1        NaN      NaN
# 9   1004       PA  1/1        1/1    12/31
# 10  1005       UT  1/1        1/1    12/31
# 11  1006       AZ  1/1        NaN      NaN

您不需要使用申请来标记位置更改。这是一个完整的工作解决方案。

import pandas as pd

#flag location changed in records_A
records_A['location_changed'] = records_A.groupby('id')['location'].\
                                shift().\
                                bfill().\
                                ne(records_A['location']).astype(int)

#merge records
merged_records = records_A.merge(records_B, \
                                 left_on=['id', 'date'], \
                                 right_on=['id', 'plan_start'], \
                                 how='outer')

#flag potential errors in records_B
merged_records['flag'] = 0
merged_records.loc[(merged_records['location_changed'] == 1) & (merged_records['plan_start'].isnull()), 'flag'] = 1

这将输出以下内容,其中标志列表示错误。

    id  location date location_changed plan_start plan_end flag
0   1001    NJ    1/1         0           1/1       3/31    0
1   1001    OH    4/1         1           4/1       12/31   0
2   1001    OH    6/1         0           NaN       NaN     0
3   1002    CA    1/1         0           1/1       12/31   0
4   1002    CA    4/1         0           NaN       NaN     0
5   1002    CA    6/1         0           NaN       NaN     0
6   1003    NJ    1/1         0           1/1       12/31   0
7   1003    NJ    4/1         0           NaN       NaN     0
8   1003    NC    6/1         1           NaN       NaN     1
9   1004    PA    1/1         1           1/1       12/31   0
10  1005    UT    1/1         1           1/1       12/31   0
11  1006    AZ    1/1         1           NaN       NaN     1