从数据框中提取行
Extracting rows from a dataframe
我有这个数据框:
我想提取客户端在 Block 操作和 Alow 操作中同时所在的行,所以我想要行:0、2、4 和 6。
使用行索引的解决方案,我无法使用它,因为我有数百万行。
如果 action
列仅包含 block 和 allow 值,您可以按 [=12] 对数据框进行分组=] 然后,计算唯一操作的数量。
例如:
df.groupby("client")["action"].nunique()
如果提取的值大于 1,则特定客户端同时拥有块和允许值。
使用groupby
、filter
和nunique
:
indexes = df.groupby('client')['action'].filter(lambda x: x.nunique() >= 2).index
filtered = df.loc[indexes]
输出:
>>> indexes.tolist()
[0, 2, 4, 6]
>>> filtered
action client
0 block client1
2 allow client1
4 block client8
6 allow client8
这是对您问题的回答,它主要依赖于 Python 逻辑而不是 Pandas 逻辑。
它还包括 timeit
与主要基于 Pandas 的方法的性能比较,这似乎表明 Python 逻辑对于所选示例快 50 倍以上超过 100,000 行。
import pandas as pd
# Sample data
n = 100000
recordData = [['allow' if i < n // 2 else 'block', 'ip="128.03.03.29"', 'source="29E9t 99 94"', 'destination="12300rtgR30"', 'client'+f'{i}'] for i in range(n)]
nDual = 20000
recordData += [['block'] + recordData[i % n][1:] for i in range(1, 7 * nDual + 1, 7)]
df = pd.DataFrame(data=recordData, columns=['action', 'adresse_ip', 'source_ip', 'destin_ip', 'client'])
print(f"Sample dataframe of length {len(df)}:")
print(df)
import timeit
def foo(df):
# Selection
blocks = {*list(df['client'][df['action'] == 'block'])}
allows = {*list(df['client'][df['action'] == 'allow'])}
duals = blocks & allows
rowsWithDuals = df[df['client'].apply(lambda x: x in duals)]
# Diagnostics
print(f"Number of rows for clients with dual actions: {len(rowsWithDuals)}")
return rowsWithDuals
print("\nPrimarily Python approach:")
t = timeit.timeit(lambda: foo(df), number = 1)
print(f"timeit: {t}")
def bar(df):
indexes = df.groupby('client')['action'].filter(lambda x: x.nunique() >= 2).index
filtered = df.loc[indexes]
print(f"Number of rows for clients with dual actions: {len(filtered)}")
return filtered
print("\nPrimarily Pandas approach:")
t = timeit.timeit(lambda: bar(df), number = 1)
print(f"timeit: {t}")
输出是:
Sample dataframe of length 120000:
action adresse_ip source_ip destin_ip client
0 allow ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client0
1 allow ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client1
2 allow ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client2
3 allow ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client3
4 allow ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client4
... ... ... ... ... ...
119995 block ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client39966
119996 block ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client39973
119997 block ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client39980
119998 block ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client39987
119999 block ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client39994
[120000 rows x 5 columns]
Primarily Python approach:
Number of rows for clients with dual actions: 25714
timeit: 0.04522189999988768
Primarily Pandas approach:
Number of rows for clients with dual actions: 25714
timeit: 3.1578059000021312
这似乎表明使用主要 Python(而非 Pandas)方法更适合大型数据集。
我有这个数据框:
我想提取客户端在 Block 操作和 Alow 操作中同时所在的行,所以我想要行:0、2、4 和 6。
使用行索引的解决方案,我无法使用它,因为我有数百万行。
如果 action
列仅包含 block 和 allow 值,您可以按 [=12] 对数据框进行分组=] 然后,计算唯一操作的数量。
例如:
df.groupby("client")["action"].nunique()
如果提取的值大于 1,则特定客户端同时拥有块和允许值。
使用groupby
、filter
和nunique
:
indexes = df.groupby('client')['action'].filter(lambda x: x.nunique() >= 2).index
filtered = df.loc[indexes]
输出:
>>> indexes.tolist()
[0, 2, 4, 6]
>>> filtered
action client
0 block client1
2 allow client1
4 block client8
6 allow client8
这是对您问题的回答,它主要依赖于 Python 逻辑而不是 Pandas 逻辑。
它还包括 timeit
与主要基于 Pandas 的方法的性能比较,这似乎表明 Python 逻辑对于所选示例快 50 倍以上超过 100,000 行。
import pandas as pd
# Sample data
n = 100000
recordData = [['allow' if i < n // 2 else 'block', 'ip="128.03.03.29"', 'source="29E9t 99 94"', 'destination="12300rtgR30"', 'client'+f'{i}'] for i in range(n)]
nDual = 20000
recordData += [['block'] + recordData[i % n][1:] for i in range(1, 7 * nDual + 1, 7)]
df = pd.DataFrame(data=recordData, columns=['action', 'adresse_ip', 'source_ip', 'destin_ip', 'client'])
print(f"Sample dataframe of length {len(df)}:")
print(df)
import timeit
def foo(df):
# Selection
blocks = {*list(df['client'][df['action'] == 'block'])}
allows = {*list(df['client'][df['action'] == 'allow'])}
duals = blocks & allows
rowsWithDuals = df[df['client'].apply(lambda x: x in duals)]
# Diagnostics
print(f"Number of rows for clients with dual actions: {len(rowsWithDuals)}")
return rowsWithDuals
print("\nPrimarily Python approach:")
t = timeit.timeit(lambda: foo(df), number = 1)
print(f"timeit: {t}")
def bar(df):
indexes = df.groupby('client')['action'].filter(lambda x: x.nunique() >= 2).index
filtered = df.loc[indexes]
print(f"Number of rows for clients with dual actions: {len(filtered)}")
return filtered
print("\nPrimarily Pandas approach:")
t = timeit.timeit(lambda: bar(df), number = 1)
print(f"timeit: {t}")
输出是:
Sample dataframe of length 120000:
action adresse_ip source_ip destin_ip client
0 allow ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client0
1 allow ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client1
2 allow ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client2
3 allow ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client3
4 allow ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client4
... ... ... ... ... ...
119995 block ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client39966
119996 block ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client39973
119997 block ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client39980
119998 block ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client39987
119999 block ip="128.03.03.29" source="29E9t 99 94" destination="12300rtgR30" client39994
[120000 rows x 5 columns]
Primarily Python approach:
Number of rows for clients with dual actions: 25714
timeit: 0.04522189999988768
Primarily Pandas approach:
Number of rows for clients with dual actions: 25714
timeit: 3.1578059000021312
这似乎表明使用主要 Python(而非 Pandas)方法更适合大型数据集。