如何将不同的数据集与日期时间索引合并?
How to merge different datasets with datetime index?
我有两个数据集(Lots 和 Measurements),它们都有日期时间索引,但长度和列不同。第一个数据集 (Lots) 的结构如下:
Datetime Index
Lot Group
Lot No
Booking Level
2013-08-03 10:00:00
1
261291.0
PROB1H
2013-08-03 12:00:00
1
261228.0
PROB1H
另一个(测量)的结构如下:
Datetime Index
MID
Passed?
Measurement1
Measurement2
Measurement3
2013-08-28 10:00:00
12345
True
46.908
3.89
29.056
2013-08-03 12:00:00
78262
True
89.457
6.88
34.918
我想做的是合并日期时间索引上的两个数据帧并获取两个数据帧中的所有列,如果日期时间索引上有匹配项,它会添加 MID,通过了吗?和测量列添加到 Lots 数据框并且还会保留重复项(如果有)并且还会保留缺失值作为 NaN,例如:
假设日期时间 2013-08-28 10:00:00 不存在于 Lots 数据框中,但存在于 Measurement 数据框中,因此会产生:
Datetime Index
Lot Group
Lot No
Booking Level
MID
Passed?
Measurement1
Measurement2
Measurement3
2013-08-28 10:00:00
NaN
NaN
NaN
12345
True
46.908
3.89
29.056
如果在日期时间 2013-08-03 12:00:00 中存在匹配项,它将产生:
Datetime Index
Lot Group
Lot No
Booking Level
MID
Passed?
Measurement1
Measurement2
Measurement3
2013-08-03 12:00:00
1
261228.0
PROB1H
78262
True
89.457
6.88
34.918
Lots 数据框的日期时间索引只有唯一的日期时间值,但 Measurement 数据框有重复条目,因此如果与重复条目匹配,我想获取重复行,例如:
假设日期时间 2021-04-15 22:00:00 存在于两个数据帧中,但在测量数据帧中多次被发现,因此它会产生以下内容:
Datetime Index
Lot Group
Lot No
Booking Level
MID
Passed?
Measurement1
Measurement2
Measurement3
2021-04-15 22:00:00
2
311000.0
PROB2H
34903
True
39
67
50
2021-04-15 22:00:00
2
311000.0
PROB2H
34904
True
88
40.90
54.38
我试过不同的合并但无法得到我想要的结果我试过:
test = lots.merge(measurement, how = "right",left_index=True, right_index=True)
test2 = lots.merge(measurement, how = "outer",left_index=True, right_index=True)
你建议我怎么做,提前致谢。
您可以使用 join
以及 merge
:
# Dataset Lots
>>> dfL
Lot Group Lot No Booking Level
Datetime Index
2013-08-03 10:00:00 1 261291.0 PROB1H
2013-08-03 12:00:00 1 261228.0 PROB1H
2021-04-15 22:00:00 2 311000.0 PROB2H
# Dataset Measurements
>>> dfM
MID Passed? Measurement1 Measurement2 Measurement3
Datetime Index
2013-08-28 10:00:00 12345 True 46.908 3.89 29.056
2013-08-03 12:00:00 78262 True 89.457 6.88 34.918
2021-04-15 22:00:00 34903 True 39.000 67.00 50.000
2021-04-15 22:00:00 34904 True 88.000 40.90 54.380
# Join version
>>> dfL.join(dfM, how='outer')
Datetime Index Lot Group Lot No Booking Level MID Passed? Measurement1 Measurement2 Measurement3
2013-08-03 10:00:00 1.0 261291.0 PROB1H NaN NaN NaN NaN NaN
2013-08-03 12:00:00 1.0 261228.0 PROB1H 78262.0 True 89.457 6.88 34.918
2013-08-28 10:00:00 NaN NaN NaN 12345.0 True 46.908 3.89 29.056
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34903.0 True 39.000 67.00 50.000
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34904.0 True 88.000 40.90 54.380
# Merge version
>>> dfL.merge(dfM, how='outer', left_index=True, right_index=True)
Datetime Index Lot Group Lot No Booking Level MID Passed? Measurement1 Measurement2 Measurement3
2013-08-03 10:00:00 1.0 261291.0 PROB1H NaN NaN NaN NaN NaN
2013-08-03 12:00:00 1.0 261228.0 PROB1H 78262.0 True 89.457 6.88 34.918
2013-08-28 10:00:00 NaN NaN NaN 12345.0 True 46.908 3.89 29.056
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34903.0 True 39.000 67.00 50.000
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34904.0 True 88.000 40.90 54.380
您的 merge
尝试接近成功。
test2 = lots.merge(measurements, how='right', on='Datetime Index')
print(test2)
Datetime Index Lot Group Lot No Booking Level MID Passed? Measurement1 Measurement2 Measurement3
0 2013-08-28 10:00:00 NaN NaN NaN 12345 True 46.908 3.89 29.056
1 2013-08-03 12:00:00 1.0 261228.0 PROB1H 78262 True 89.457 6.88 34.918
如果您省略 on='Datetime Index'
,此示例仍然有效,但最好保留它以显示意图。来自 DataFrame.merge:
If on
is None
and not merging on indexes then [merge] defaults to the
intersection of the columns in both DataFrames.
我有两个数据集(Lots 和 Measurements),它们都有日期时间索引,但长度和列不同。第一个数据集 (Lots) 的结构如下:
Datetime Index | Lot Group | Lot No | Booking Level |
---|---|---|---|
2013-08-03 10:00:00 | 1 | 261291.0 | PROB1H |
2013-08-03 12:00:00 | 1 | 261228.0 | PROB1H |
另一个(测量)的结构如下:
Datetime Index | MID | Passed? | Measurement1 | Measurement2 | Measurement3 |
---|---|---|---|---|---|
2013-08-28 10:00:00 | 12345 | True | 46.908 | 3.89 | 29.056 |
2013-08-03 12:00:00 | 78262 | True | 89.457 | 6.88 | 34.918 |
我想做的是合并日期时间索引上的两个数据帧并获取两个数据帧中的所有列,如果日期时间索引上有匹配项,它会添加 MID,通过了吗?和测量列添加到 Lots 数据框并且还会保留重复项(如果有)并且还会保留缺失值作为 NaN,例如:
假设日期时间 2013-08-28 10:00:00 不存在于 Lots 数据框中,但存在于 Measurement 数据框中,因此会产生:
Datetime Index | Lot Group | Lot No | Booking Level | MID | Passed? | Measurement1 | Measurement2 | Measurement3 |
---|---|---|---|---|---|---|---|---|
2013-08-28 10:00:00 | NaN | NaN | NaN | 12345 | True | 46.908 | 3.89 | 29.056 |
如果在日期时间 2013-08-03 12:00:00 中存在匹配项,它将产生:
Datetime Index | Lot Group | Lot No | Booking Level | MID | Passed? | Measurement1 | Measurement2 | Measurement3 |
---|---|---|---|---|---|---|---|---|
2013-08-03 12:00:00 | 1 | 261228.0 | PROB1H | 78262 | True | 89.457 | 6.88 | 34.918 |
Lots 数据框的日期时间索引只有唯一的日期时间值,但 Measurement 数据框有重复条目,因此如果与重复条目匹配,我想获取重复行,例如:
假设日期时间 2021-04-15 22:00:00 存在于两个数据帧中,但在测量数据帧中多次被发现,因此它会产生以下内容:
Datetime Index | Lot Group | Lot No | Booking Level | MID | Passed? | Measurement1 | Measurement2 | Measurement3 |
---|---|---|---|---|---|---|---|---|
2021-04-15 22:00:00 | 2 | 311000.0 | PROB2H | 34903 | True | 39 | 67 | 50 |
2021-04-15 22:00:00 | 2 | 311000.0 | PROB2H | 34904 | True | 88 | 40.90 | 54.38 |
我试过不同的合并但无法得到我想要的结果我试过:
test = lots.merge(measurement, how = "right",left_index=True, right_index=True)
test2 = lots.merge(measurement, how = "outer",left_index=True, right_index=True)
你建议我怎么做,提前致谢。
您可以使用 join
以及 merge
:
# Dataset Lots
>>> dfL
Lot Group Lot No Booking Level
Datetime Index
2013-08-03 10:00:00 1 261291.0 PROB1H
2013-08-03 12:00:00 1 261228.0 PROB1H
2021-04-15 22:00:00 2 311000.0 PROB2H
# Dataset Measurements
>>> dfM
MID Passed? Measurement1 Measurement2 Measurement3
Datetime Index
2013-08-28 10:00:00 12345 True 46.908 3.89 29.056
2013-08-03 12:00:00 78262 True 89.457 6.88 34.918
2021-04-15 22:00:00 34903 True 39.000 67.00 50.000
2021-04-15 22:00:00 34904 True 88.000 40.90 54.380
# Join version
>>> dfL.join(dfM, how='outer')
Datetime Index Lot Group Lot No Booking Level MID Passed? Measurement1 Measurement2 Measurement3
2013-08-03 10:00:00 1.0 261291.0 PROB1H NaN NaN NaN NaN NaN
2013-08-03 12:00:00 1.0 261228.0 PROB1H 78262.0 True 89.457 6.88 34.918
2013-08-28 10:00:00 NaN NaN NaN 12345.0 True 46.908 3.89 29.056
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34903.0 True 39.000 67.00 50.000
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34904.0 True 88.000 40.90 54.380
# Merge version
>>> dfL.merge(dfM, how='outer', left_index=True, right_index=True)
Datetime Index Lot Group Lot No Booking Level MID Passed? Measurement1 Measurement2 Measurement3
2013-08-03 10:00:00 1.0 261291.0 PROB1H NaN NaN NaN NaN NaN
2013-08-03 12:00:00 1.0 261228.0 PROB1H 78262.0 True 89.457 6.88 34.918
2013-08-28 10:00:00 NaN NaN NaN 12345.0 True 46.908 3.89 29.056
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34903.0 True 39.000 67.00 50.000
2021-04-15 22:00:00 2.0 311000.0 PROB2H 34904.0 True 88.000 40.90 54.380
您的 merge
尝试接近成功。
test2 = lots.merge(measurements, how='right', on='Datetime Index')
print(test2)
Datetime Index Lot Group Lot No Booking Level MID Passed? Measurement1 Measurement2 Measurement3
0 2013-08-28 10:00:00 NaN NaN NaN 12345 True 46.908 3.89 29.056
1 2013-08-03 12:00:00 1.0 261228.0 PROB1H 78262 True 89.457 6.88 34.918
如果您省略 on='Datetime Index'
,此示例仍然有效,但最好保留它以显示意图。来自 DataFrame.merge:
If
on
isNone
and not merging on indexes then [merge] defaults to the intersection of the columns in both DataFrames.