如何使用 Python 基于另一个 DataFrame 中的行 select DataFrame 中的行
How to select rows in a DataFrame based on rows in another DataFrame using Python
我有两个数据框,df1 如下所示:
id year CalendarWeek DayName interval counts
1 2014 1 sun 10:30 3
1 2014 1 sun 11:30 4
1 2014 2 wed 12:00 5
1 2014 2 fri 9:00 2
2 2014 1 sun 13:00 3
2 2014 1 sun 14:30 1
2 2014 1 mon 10:30 2
2 2014 2 wed 14:00 3
2 2014 2 fri 15:00 5
3 2014 1 thu 16:30 2
3 2014 1 thu 17:00 1
3 2014 2 sat 12:00 2
3 2014 2 sat 13:30 3
df2 如下所示:
id year CalendarWeek DayName interval NewCounts
1 2014 1 sun 10:00 2
1 2014 1 sun 10:30 4
1 2014 1 sun 11:30 5
1 2014 2 wed 10:30 6
1 2014 2 wed 12:00 3
1 2014 2 fri 8:30 1
1 2014 2 fri 9:00 2
2 2014 1 sun 12:30 3
2 2014 1 sun 13:00 4
2 2014 1 sun 14:30 4
2 2014 1 mon 9:00 35
2 2014 1 mon 10:30 1
2 2014 2 wed 12:30 23
2 2014 2 wed 14:00 4
2 2014 2 fri 15:00 3
3 2014 1 thu 14:30 1
3 2014 1 thu 15:00 3
3 2014 1 thu 16:30 34
3 2014 1 thu 17:00 5
3 2014 2 sat 12:00 3
3 2014 2 sat 13:30 4
3 2014 2 sat 14:00 2
我想获取 df2 中与 df1 中的列 id、year、CalendarWeek、DayName 和 interval 匹配的所有行。
我想要的结果应该如下所示:
id year CalendarWeek DayName interval NewCounts
1 2014 1 sun 10:30 4
1 2014 1 sun 11:30 5
1 2014 2 wed 12:00 3
1 2014 2 fri 9:00 2
2 2014 1 sun 13:00 4
2 2014 1 sun 14:30 4
2 2014 1 mon 10:30 1
2 2014 2 wed 14:00 4
2 2014 2 fri 15:00 3
3 2014 1 thu 16:30 34
3 2014 1 thu 17:00 5
3 2014 2 sat 12:00 3
3 2014 2 sat 13:30 4
在Python中,如何select一个数据框中的这些特定行基于另一个数据框中的列?
谢谢!
执行 merge
and pass the list of columns to param on
, the default type of merge is 'inner'
仅匹配两个 dfs 中存在的值:
In [2]:
df.merge(df1, on=['id','year','CalendarWeek','DayName','interval'])
Out[2]:
id year CalendarWeek DayName interval counts NewCounts
0 1 2014 1 sun 10:30 3 4
1 1 2014 1 sun 11:30 4 5
2 1 2014 2 wed 12:00 5 3
3 1 2014 2 fri 9:00 2 2
4 2 2014 1 sun 13:00 3 4
5 2 2014 1 sun 14:30 1 4
6 2 2014 1 mon 10:30 2 1
7 2 2014 2 wed 14:00 3 4
8 2 2014 2 fri 15:00 5 3
9 3 2014 1 thu 16:30 2 34
10 3 2014 1 thu 17:00 1 5
11 3 2014 2 sat 12:00 2 3
12 3 2014 2 sat 13:30 3 4
如果您的 'id' 列是您的索引,您必须重置两个 df 上的索引,以便它们成为 df 中的列,这是因为如果您指定 on
列列表并指定 left_index=True
和 right_index=True
:
In [4]:
df.merge(df1, on=['year','CalendarWeek','DayName','interval'], left_index=True, right_index=True)
Out[4]:
year CalendarWeek DayName interval counts NewCounts
id
1 2014 1 sun 10:30 3 2
1 2014 1 sun 10:30 3 4
1 2014 1 sun 10:30 3 5
1 2014 1 sun 10:30 3 6
1 2014 1 sun 10:30 3 3
1 2014 1 sun 10:30 3 1
1 2014 1 sun 10:30 3 2
1 2014 1 sun 11:30 4 2
1 2014 1 sun 11:30 4 4
1 2014 1 sun 11:30 4 5
1 2014 1 sun 11:30 4 6
1 2014 1 sun 11:30 4 3
1 2014 1 sun 11:30 4 1
1 2014 1 sun 11:30 4 2
1 2014 2 wed 12:00 5 2
1 2014 2 wed 12:00 5 4
1 2014 2 wed 12:00 5 5
1 2014 2 wed 12:00 5 6
1 2014 2 wed 12:00 5 3
1 2014 2 wed 12:00 5 1
1 2014 2 wed 12:00 5 2
1 2014 2 fri 9:00 2 2
1 2014 2 fri 9:00 2 4
1 2014 2 fri 9:00 2 5
1 2014 2 fri 9:00 2 6
1 2014 2 fri 9:00 2 3
1 2014 2 fri 9:00 2 1
1 2014 2 fri 9:00 2 2
2 2014 1 sun 13:00 3 3
2 2014 1 sun 13:00 3 4
.. ... ... ... ... ... ...
2 2014 2 fri 15:00 5 4
2 2014 2 fri 15:00 5 3
3 2014 1 thu 16:30 2 1
3 2014 1 thu 16:30 2 3
3 2014 1 thu 16:30 2 34
3 2014 1 thu 16:30 2 5
3 2014 1 thu 16:30 2 3
3 2014 1 thu 16:30 2 4
3 2014 1 thu 16:30 2 2
3 2014 1 thu 17:00 1 1
3 2014 1 thu 17:00 1 3
3 2014 1 thu 17:00 1 34
3 2014 1 thu 17:00 1 5
3 2014 1 thu 17:00 1 3
3 2014 1 thu 17:00 1 4
3 2014 1 thu 17:00 1 2
3 2014 2 sat 12:00 2 1
3 2014 2 sat 12:00 2 3
3 2014 2 sat 12:00 2 34
3 2014 2 sat 12:00 2 5
3 2014 2 sat 12:00 2 3
3 2014 2 sat 12:00 2 4
3 2014 2 sat 12:00 2 2
3 2014 2 sat 13:30 3 1
3 2014 2 sat 13:30 3 3
3 2014 2 sat 13:30 3 34
3 2014 2 sat 13:30 3 5
3 2014 2 sat 13:30 3 3
3 2014 2 sat 13:30 3 4
3 2014 2 sat 13:30 3 2
[96 rows x 6 columns]
所以要重置索引只需执行 df = df.reset_index(0)
并且对于其他 df 也是如此,合并后您可以将索引设置回 id so:
merged = df.merge(df1, on=['id','year','CalendarWeek','DayName','interval'])
merged = merged.reset_index()
我有两个数据框,df1 如下所示:
id year CalendarWeek DayName interval counts
1 2014 1 sun 10:30 3
1 2014 1 sun 11:30 4
1 2014 2 wed 12:00 5
1 2014 2 fri 9:00 2
2 2014 1 sun 13:00 3
2 2014 1 sun 14:30 1
2 2014 1 mon 10:30 2
2 2014 2 wed 14:00 3
2 2014 2 fri 15:00 5
3 2014 1 thu 16:30 2
3 2014 1 thu 17:00 1
3 2014 2 sat 12:00 2
3 2014 2 sat 13:30 3
df2 如下所示:
id year CalendarWeek DayName interval NewCounts
1 2014 1 sun 10:00 2
1 2014 1 sun 10:30 4
1 2014 1 sun 11:30 5
1 2014 2 wed 10:30 6
1 2014 2 wed 12:00 3
1 2014 2 fri 8:30 1
1 2014 2 fri 9:00 2
2 2014 1 sun 12:30 3
2 2014 1 sun 13:00 4
2 2014 1 sun 14:30 4
2 2014 1 mon 9:00 35
2 2014 1 mon 10:30 1
2 2014 2 wed 12:30 23
2 2014 2 wed 14:00 4
2 2014 2 fri 15:00 3
3 2014 1 thu 14:30 1
3 2014 1 thu 15:00 3
3 2014 1 thu 16:30 34
3 2014 1 thu 17:00 5
3 2014 2 sat 12:00 3
3 2014 2 sat 13:30 4
3 2014 2 sat 14:00 2
我想获取 df2 中与 df1 中的列 id、year、CalendarWeek、DayName 和 interval 匹配的所有行。 我想要的结果应该如下所示:
id year CalendarWeek DayName interval NewCounts
1 2014 1 sun 10:30 4
1 2014 1 sun 11:30 5
1 2014 2 wed 12:00 3
1 2014 2 fri 9:00 2
2 2014 1 sun 13:00 4
2 2014 1 sun 14:30 4
2 2014 1 mon 10:30 1
2 2014 2 wed 14:00 4
2 2014 2 fri 15:00 3
3 2014 1 thu 16:30 34
3 2014 1 thu 17:00 5
3 2014 2 sat 12:00 3
3 2014 2 sat 13:30 4
在Python中,如何select一个数据框中的这些特定行基于另一个数据框中的列?
谢谢!
执行 merge
and pass the list of columns to param on
, the default type of merge is 'inner'
仅匹配两个 dfs 中存在的值:
In [2]:
df.merge(df1, on=['id','year','CalendarWeek','DayName','interval'])
Out[2]:
id year CalendarWeek DayName interval counts NewCounts
0 1 2014 1 sun 10:30 3 4
1 1 2014 1 sun 11:30 4 5
2 1 2014 2 wed 12:00 5 3
3 1 2014 2 fri 9:00 2 2
4 2 2014 1 sun 13:00 3 4
5 2 2014 1 sun 14:30 1 4
6 2 2014 1 mon 10:30 2 1
7 2 2014 2 wed 14:00 3 4
8 2 2014 2 fri 15:00 5 3
9 3 2014 1 thu 16:30 2 34
10 3 2014 1 thu 17:00 1 5
11 3 2014 2 sat 12:00 2 3
12 3 2014 2 sat 13:30 3 4
如果您的 'id' 列是您的索引,您必须重置两个 df 上的索引,以便它们成为 df 中的列,这是因为如果您指定 on
列列表并指定 left_index=True
和 right_index=True
:
In [4]:
df.merge(df1, on=['year','CalendarWeek','DayName','interval'], left_index=True, right_index=True)
Out[4]:
year CalendarWeek DayName interval counts NewCounts
id
1 2014 1 sun 10:30 3 2
1 2014 1 sun 10:30 3 4
1 2014 1 sun 10:30 3 5
1 2014 1 sun 10:30 3 6
1 2014 1 sun 10:30 3 3
1 2014 1 sun 10:30 3 1
1 2014 1 sun 10:30 3 2
1 2014 1 sun 11:30 4 2
1 2014 1 sun 11:30 4 4
1 2014 1 sun 11:30 4 5
1 2014 1 sun 11:30 4 6
1 2014 1 sun 11:30 4 3
1 2014 1 sun 11:30 4 1
1 2014 1 sun 11:30 4 2
1 2014 2 wed 12:00 5 2
1 2014 2 wed 12:00 5 4
1 2014 2 wed 12:00 5 5
1 2014 2 wed 12:00 5 6
1 2014 2 wed 12:00 5 3
1 2014 2 wed 12:00 5 1
1 2014 2 wed 12:00 5 2
1 2014 2 fri 9:00 2 2
1 2014 2 fri 9:00 2 4
1 2014 2 fri 9:00 2 5
1 2014 2 fri 9:00 2 6
1 2014 2 fri 9:00 2 3
1 2014 2 fri 9:00 2 1
1 2014 2 fri 9:00 2 2
2 2014 1 sun 13:00 3 3
2 2014 1 sun 13:00 3 4
.. ... ... ... ... ... ...
2 2014 2 fri 15:00 5 4
2 2014 2 fri 15:00 5 3
3 2014 1 thu 16:30 2 1
3 2014 1 thu 16:30 2 3
3 2014 1 thu 16:30 2 34
3 2014 1 thu 16:30 2 5
3 2014 1 thu 16:30 2 3
3 2014 1 thu 16:30 2 4
3 2014 1 thu 16:30 2 2
3 2014 1 thu 17:00 1 1
3 2014 1 thu 17:00 1 3
3 2014 1 thu 17:00 1 34
3 2014 1 thu 17:00 1 5
3 2014 1 thu 17:00 1 3
3 2014 1 thu 17:00 1 4
3 2014 1 thu 17:00 1 2
3 2014 2 sat 12:00 2 1
3 2014 2 sat 12:00 2 3
3 2014 2 sat 12:00 2 34
3 2014 2 sat 12:00 2 5
3 2014 2 sat 12:00 2 3
3 2014 2 sat 12:00 2 4
3 2014 2 sat 12:00 2 2
3 2014 2 sat 13:30 3 1
3 2014 2 sat 13:30 3 3
3 2014 2 sat 13:30 3 34
3 2014 2 sat 13:30 3 5
3 2014 2 sat 13:30 3 3
3 2014 2 sat 13:30 3 4
3 2014 2 sat 13:30 3 2
[96 rows x 6 columns]
所以要重置索引只需执行 df = df.reset_index(0)
并且对于其他 df 也是如此,合并后您可以将索引设置回 id so:
merged = df.merge(df1, on=['id','year','CalendarWeek','DayName','interval'])
merged = merged.reset_index()