选择 DataFrame 中不在 Series 中的行
Selecting rows in a DataFrame that are not in a Series
所以我有一个名为 trips
的 DataFrame,其中包含以下信息:
route_id service_id shape_id trip_id
0 BX12 GH_B6-Weekday BX120805 GH_B6-Weekday-004000_BX12_1
1 BX12 GH_B6-Weekday BX120809 GH_B6-Weekday-009000_BX12_1
2 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-013000_BX12_1
3 BX12 GH_B6-Weekday BX120809 GH_B6-Weekday-017000_BX12_1
4 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-021000_BX12_1
...
我还有一个名为 invalidTrips
的系列,其中包含以下信息:
trip_id
11760139-BPPB6-BP_B6-Weekday-10 16
11760139-BPPB6-BP_B6-Weekday-10-SDon 16
11760140-BPPB6-BP_B6-Weekday-10 19
11760140-BPPB6-BP_B6-Weekday-10-SDon 19
11760141-BPPB6-BP_B6-Weekday-10 16
...
我如何 select trips
中没有 trip_id
匹配 invalid_trips
中的 trip_id
的所有行?
编辑:所以现在我有了这段代码:
# Grab the number of trips made outside min and max hour.
tooEarly = stopTimes['arrival_time'] < base_mintime
tooLate = stopTimes['departure_time'] > base_maxtime
invalidTrips = stopTimes[(tooEarly | tooLate)].groupby('trip_id').size()
# Filter out the invalid trips.
print(invalidTrips.size)
print(trips.size)
in_validTrips = ~trips.trip_id.isin(invalidTrips)
validTrips = trips[in_validTrips][['route_id', 'service_id', 'shape_id']]
print(validTrips.size)
无论出于何种原因,尽管 invalidTrips.size
可以根据 base_mintime
和 base_maxtime
发生变化,但 validTrips.size
保持不变,尽管我希望它是反向依赖的在 invalidTrips.size
上。为什么会这样?
(有关更多背景信息,这些都是从 GTFS 数据中提取的。)
更新:
尝试isin()
函数和~
运算符
根据@EdChum 在评论中的更正 - 如果 invalid_trips
是系列类型:
trips[~trips.trip_id.isin(invalidTrips.index)]
测试:
In [39]: invalidTrips
Out[39]:
trip_id
11760139-BPPB6-BP_B6-Weekday-10 16
11760139-BPPB6-BP_B6-Weekday-10-SDon 16
11760140-BPPB6-BP_B6-Weekday-10 19
11760140-BPPB6-BP_B6-Weekday-10-SDon 19
11760141-BPPB6-BP_B6-Weekday-10 16
GH_B6-Weekday-017000_BX12_1 11 # <-- i've added it intentionally
Name: val, dtype: int64
In [40]: trips
Out[40]:
route_id service_id shape_id trip_id
0 BX12 GH_B6-Weekday BX120805 GH_B6-Weekday-004000_BX12_1
1 BX12 GH_B6-Weekday BX120809 GH_B6-Weekday-009000_BX12_1
2 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-013000_BX12_1
3 BX12 GH_B6-Weekday BX120809 GH_B6-Weekday-017000_BX12_1 # <-- exclude this row
4 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-021000_BX12_1
In [41]: trips[~trips.trip_id.isin(invalidTrips.index)]
Out[41]:
route_id service_id shape_id trip_id
0 BX12 GH_B6-Weekday BX120805 GH_B6-Weekday-004000_BX12_1
1 BX12 GH_B6-Weekday BX120809 GH_B6-Weekday-009000_BX12_1
2 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-013000_BX12_1
4 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-021000_BX12_1
所以我有一个名为 trips
的 DataFrame,其中包含以下信息:
route_id service_id shape_id trip_id
0 BX12 GH_B6-Weekday BX120805 GH_B6-Weekday-004000_BX12_1
1 BX12 GH_B6-Weekday BX120809 GH_B6-Weekday-009000_BX12_1
2 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-013000_BX12_1
3 BX12 GH_B6-Weekday BX120809 GH_B6-Weekday-017000_BX12_1
4 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-021000_BX12_1
...
我还有一个名为 invalidTrips
的系列,其中包含以下信息:
trip_id
11760139-BPPB6-BP_B6-Weekday-10 16
11760139-BPPB6-BP_B6-Weekday-10-SDon 16
11760140-BPPB6-BP_B6-Weekday-10 19
11760140-BPPB6-BP_B6-Weekday-10-SDon 19
11760141-BPPB6-BP_B6-Weekday-10 16
...
我如何 select trips
中没有 trip_id
匹配 invalid_trips
中的 trip_id
的所有行?
编辑:所以现在我有了这段代码:
# Grab the number of trips made outside min and max hour.
tooEarly = stopTimes['arrival_time'] < base_mintime
tooLate = stopTimes['departure_time'] > base_maxtime
invalidTrips = stopTimes[(tooEarly | tooLate)].groupby('trip_id').size()
# Filter out the invalid trips.
print(invalidTrips.size)
print(trips.size)
in_validTrips = ~trips.trip_id.isin(invalidTrips)
validTrips = trips[in_validTrips][['route_id', 'service_id', 'shape_id']]
print(validTrips.size)
无论出于何种原因,尽管 invalidTrips.size
可以根据 base_mintime
和 base_maxtime
发生变化,但 validTrips.size
保持不变,尽管我希望它是反向依赖的在 invalidTrips.size
上。为什么会这样?
(有关更多背景信息,这些都是从 GTFS 数据中提取的。)
更新:
尝试isin()
函数和~
运算符
根据@EdChum 在评论中的更正 - 如果 invalid_trips
是系列类型:
trips[~trips.trip_id.isin(invalidTrips.index)]
测试:
In [39]: invalidTrips
Out[39]:
trip_id
11760139-BPPB6-BP_B6-Weekday-10 16
11760139-BPPB6-BP_B6-Weekday-10-SDon 16
11760140-BPPB6-BP_B6-Weekday-10 19
11760140-BPPB6-BP_B6-Weekday-10-SDon 19
11760141-BPPB6-BP_B6-Weekday-10 16
GH_B6-Weekday-017000_BX12_1 11 # <-- i've added it intentionally
Name: val, dtype: int64
In [40]: trips
Out[40]:
route_id service_id shape_id trip_id
0 BX12 GH_B6-Weekday BX120805 GH_B6-Weekday-004000_BX12_1
1 BX12 GH_B6-Weekday BX120809 GH_B6-Weekday-009000_BX12_1
2 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-013000_BX12_1
3 BX12 GH_B6-Weekday BX120809 GH_B6-Weekday-017000_BX12_1 # <-- exclude this row
4 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-021000_BX12_1
In [41]: trips[~trips.trip_id.isin(invalidTrips.index)]
Out[41]:
route_id service_id shape_id trip_id
0 BX12 GH_B6-Weekday BX120805 GH_B6-Weekday-004000_BX12_1
1 BX12 GH_B6-Weekday BX120809 GH_B6-Weekday-009000_BX12_1
2 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-013000_BX12_1
4 BX12 GH_B6-Weekday BX120792 GH_B6-Weekday-021000_BX12_1