使用 Pandas df.apply 创建新列
Creating new columns with Pandas df.apply
我有一个巨大的 NetFlow 数据库(它包含时间戳、源 IP、目标 IP、协议、源和目标端口号、交换的数据包、字节等)。我想根据当前行和之前的行创建自定义属性。
我想根据当前行的源ip和时间戳来计算新的列。这是我想在逻辑上做的事情:
- 获取当前行的源ip。
- 获取当前行的时间戳。
- 基于源 IP 和时间戳,我想获取整个数据帧的所有前几行,与源 IP 匹配,并且通信发生在最后半小时内。这很重要。
- 对于符合条件(源 ip 和发生在过去半小时内)的行(流量,在我的示例中),我想计算所有数据包和所有字节的总和和平均值。
One row from the dataset
相关代码片段:
df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])
df['ts'] = pd.to_datetime(df['ts'])
def prev_30_ip_sum(ts,sa,size):
global joined
for (x,y) in zip(df['sa'], df['ts']):
...
return sum
df['prev30ipsumpkt'] = df.apply(lambda x: prev_30_ip_sum(x['ts'],x['sa'],x['pkt']), axis = 1)
我知道可能有更好、更有效的方法来做到这一点,但遗憾的是我不是最好的程序员。
谢谢。
记录在案
from datetime import timedelta
def fun(df, i):
# Current timestamp
current = df.loc[i, 'ts']
# timestamp of last 30 minutes
last = current - timedelta(minutes=30)
# Current IP
ip = df.loc[i, 'sa']
# df matching the criterian
adf = df[(last <= df['ts']) & (current > df['ts']) & (df['sa'] == ip)]
# Return sum and mean
return adf['pkt'].sum(), adf['pkt'].mean()
# Apply the fun over each row
result = [fun(df, i) for i in df.index]
# Create new columns
df['sum'] = [i[0] for i in result]
df['mean'] = [i[1] for i in result]
df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])
df['ts'] = pd.to_datetime(df['ts'])
def prev_30_ip_sum(df, i):
#current time from current row
current = df.loc[i, 'ts']
# timestamp of last 30 minutes
last = current - timedelta(minutes=30)
# Current source address
sa = df.loc[i, 'sa']
# new dataframe for timestamp less than 30 min and same ip as current one
new_df = df[(last <= df['ts']) & (current > df['ts']) & (df['sa'] == sa)]
# Return sum and mean
return new_df['pkt'].sum(), new_df['pkt'].mean()
# Take sa and timestamp of each row and create new dataframe
result = [prev_30_ip_sum(df, i) for i in df.index]
# Create new columns in current database.
df['sum'] = [i[0] for i in result]
df['mean'] = [i[1] for i in result]
我有一个巨大的 NetFlow 数据库(它包含时间戳、源 IP、目标 IP、协议、源和目标端口号、交换的数据包、字节等)。我想根据当前行和之前的行创建自定义属性。
我想根据当前行的源ip和时间戳来计算新的列。这是我想在逻辑上做的事情:
- 获取当前行的源ip。
- 获取当前行的时间戳。
- 基于源 IP 和时间戳,我想获取整个数据帧的所有前几行,与源 IP 匹配,并且通信发生在最后半小时内。这很重要。
- 对于符合条件(源 ip 和发生在过去半小时内)的行(流量,在我的示例中),我想计算所有数据包和所有字节的总和和平均值。
One row from the dataset
相关代码片段:
df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])
df['ts'] = pd.to_datetime(df['ts'])
def prev_30_ip_sum(ts,sa,size):
global joined
for (x,y) in zip(df['sa'], df['ts']):
...
return sum
df['prev30ipsumpkt'] = df.apply(lambda x: prev_30_ip_sum(x['ts'],x['sa'],x['pkt']), axis = 1)
我知道可能有更好、更有效的方法来做到这一点,但遗憾的是我不是最好的程序员。
谢谢。
记录在案
from datetime import timedelta
def fun(df, i):
# Current timestamp
current = df.loc[i, 'ts']
# timestamp of last 30 minutes
last = current - timedelta(minutes=30)
# Current IP
ip = df.loc[i, 'sa']
# df matching the criterian
adf = df[(last <= df['ts']) & (current > df['ts']) & (df['sa'] == ip)]
# Return sum and mean
return adf['pkt'].sum(), adf['pkt'].mean()
# Apply the fun over each row
result = [fun(df, i) for i in df.index]
# Create new columns
df['sum'] = [i[0] for i in result]
df['mean'] = [i[1] for i in result]
df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])
df['ts'] = pd.to_datetime(df['ts'])
def prev_30_ip_sum(df, i):
#current time from current row
current = df.loc[i, 'ts']
# timestamp of last 30 minutes
last = current - timedelta(minutes=30)
# Current source address
sa = df.loc[i, 'sa']
# new dataframe for timestamp less than 30 min and same ip as current one
new_df = df[(last <= df['ts']) & (current > df['ts']) & (df['sa'] == sa)]
# Return sum and mean
return new_df['pkt'].sum(), new_df['pkt'].mean()
# Take sa and timestamp of each row and create new dataframe
result = [prev_30_ip_sum(df, i) for i in df.index]
# Create new columns in current database.
df['sum'] = [i[0] for i in result]
df['mean'] = [i[1] for i in result]