pandas, 查找并保留连续的行 - 创建面板数据
pandas, Find and keep consecutive rows - create a panel data
我有一个如下所示的 DataFrame:
df = {'time': [1999,2001,2002,2003,2007,1999,2000,2001,2003,2004],
'id':['A','A','A','A','A','B','B','B','B','B'],
'value':[0.1,0.1,0.1,0.1,0.6,0.2,0.2,0.2,0.2,0.2]}
df = pd.DataFrame(df)
我想从中创建 id-time
级别的面板数据集,这意味着,我想要这样的东西:
time id value
0 2001 A 0.1
1 2002 A 0.1
2 2003 A 0.6
3 1999 B 0.2
4 2000 B 0.2
5 2001 B 0.2
每个 id
只剩下连续的行,我可以用 R 中的几行来完成这个,
df<-df %>%
mutate(time = as.integer(time)) %>%
group_by(gvkey, grp = cumsum(c(1, diff(time) != 1))) %>%
filter(n() >= consec_obs)
df<-df[,setdiff(colnames(df),c('grp'))]
其中 consec_obs
是要保留的连续行的最小值。
我搜索了一段时间但找不到解决方案,这让我有点吃惊,因为这是一种基本的计量经济学分析操作,有人知道如何使用 Python?
模仿 R 的解决方案,我在周日晚上想出了一个 Python 版本,这里是:
# lag where two rows within each group are not conesecutive
df['diff'] = df.groupby('id')['time'].diff()!=1
# cumulative summation
df['cusm'] = df.groupby('id')['diff'].cumsum()
# group by 'id' and 'cusm', then select those rows which satisfy prespecified condition
df.loc[df.groupby(['id','cusm']).transform('count')['diff'] >=3].drop(['diff','cusm'],axis=1)
如果这看起来有点难以理解,请逐行尝试代码,您会成功的。
能否将前两行合并为一个?
希望对您有所帮助。我会在前进的过程中尝试解释每一行。
导入这 2 个包。
from itertools import groupby
import numpy as np
您的数据框看起来像这样:
>>>df = {'time': [1999,2001,2002,2003,2007,1999,2000,2001,2003,2004],
'id':['A','A','A','A','A','B','B','B','B','B'],
'value':[0.1,0.1,0.1,0.1,0.6,0.2,0.2,0.2,0.2,0.2]}
>>>df = pd.DataFrame(df)
>>>df
id time value
0 A 1999 0.1
1 A 2001 0.1
2 A 2002 0.1
3 A 2003 0.1
4 A 2007 0.6
5 B 1999 0.2
6 B 2000 0.2
7 B 2001 0.2
8 B 2003 0.2
9 B 2004 0.2
第一步:
查找唯一 ID。这就是你的做法:
>>>unique = np.unique(df.id.values).tolist()
>>>unique
['A', 'B']
第二步:
对于每个 ID,创建一个列表列表(我将其命名为 Group)。外部列表中的每个列表都包含连续的数字。为清楚起见,我将打印该小组的内容。它将一组连续数字组合在一起。
第三步:
分组后,仅为分组长度大于 2 的那些值创建数据框。(我假设 2,因为您没有将 B:2003 & B:2004 视为连续序列。)
工作原理如下:
# Create an Empty dataframe. This is where you will keep appending peices of dataframes
df2 = pd.DataFrame()
# Now you would want to iterate over your unique IDs ie. 'A', 'B'.
for i in unique:
#Create an empty list called Group. Here you will append lists that contain consecutive numbers.
groups = []
#Create a data frame where ID is equal to current iterating ID
df1 = df.loc[df['id'] == i]
#The next 2 for loops (nested) will return group (a list of lists)
for key, group in groupby(enumerate(df1.time.values), lambda ix : ix[0] - ix[1]):
list1 = []
for j in list(group):
list1.append(j[1])
groups.append(list1)
# See how your group for current ID looks
print(groups)
# Iterate within the created group. See if group length is > 2. If yes, append to df2 (the empty data frame that you created earlier)
for j in groups:
if len(j) > 1:
# you are concatenating 2 frames in the below code.
df2 = pd.concat([df2,df.loc[(df['time'].isin(j)) & (df['id'] == i)]])
瞧
>>>> df2
id time value
1 A 2001 0.1
2 A 2002 0.1
3 A 2003 0.1
5 B 1999 0.2
6 B 2000 0.2
7 B 2001 0.2
我有一个如下所示的 DataFrame:
df = {'time': [1999,2001,2002,2003,2007,1999,2000,2001,2003,2004],
'id':['A','A','A','A','A','B','B','B','B','B'],
'value':[0.1,0.1,0.1,0.1,0.6,0.2,0.2,0.2,0.2,0.2]}
df = pd.DataFrame(df)
我想从中创建 id-time
级别的面板数据集,这意味着,我想要这样的东西:
time id value
0 2001 A 0.1
1 2002 A 0.1
2 2003 A 0.6
3 1999 B 0.2
4 2000 B 0.2
5 2001 B 0.2
每个 id
只剩下连续的行,我可以用 R 中的几行来完成这个,
df<-df %>%
mutate(time = as.integer(time)) %>%
group_by(gvkey, grp = cumsum(c(1, diff(time) != 1))) %>%
filter(n() >= consec_obs)
df<-df[,setdiff(colnames(df),c('grp'))]
其中 consec_obs
是要保留的连续行的最小值。
我搜索了一段时间但找不到解决方案,这让我有点吃惊,因为这是一种基本的计量经济学分析操作,有人知道如何使用 Python?
模仿 R 的解决方案,我在周日晚上想出了一个 Python 版本,这里是:
# lag where two rows within each group are not conesecutive
df['diff'] = df.groupby('id')['time'].diff()!=1
# cumulative summation
df['cusm'] = df.groupby('id')['diff'].cumsum()
# group by 'id' and 'cusm', then select those rows which satisfy prespecified condition
df.loc[df.groupby(['id','cusm']).transform('count')['diff'] >=3].drop(['diff','cusm'],axis=1)
如果这看起来有点难以理解,请逐行尝试代码,您会成功的。
能否将前两行合并为一个?
希望对您有所帮助。我会在前进的过程中尝试解释每一行。
导入这 2 个包。
from itertools import groupby
import numpy as np
您的数据框看起来像这样:
>>>df = {'time': [1999,2001,2002,2003,2007,1999,2000,2001,2003,2004],
'id':['A','A','A','A','A','B','B','B','B','B'],
'value':[0.1,0.1,0.1,0.1,0.6,0.2,0.2,0.2,0.2,0.2]}
>>>df = pd.DataFrame(df)
>>>df
id time value
0 A 1999 0.1
1 A 2001 0.1
2 A 2002 0.1
3 A 2003 0.1
4 A 2007 0.6
5 B 1999 0.2
6 B 2000 0.2
7 B 2001 0.2
8 B 2003 0.2
9 B 2004 0.2
第一步: 查找唯一 ID。这就是你的做法:
>>>unique = np.unique(df.id.values).tolist()
>>>unique
['A', 'B']
第二步: 对于每个 ID,创建一个列表列表(我将其命名为 Group)。外部列表中的每个列表都包含连续的数字。为清楚起见,我将打印该小组的内容。它将一组连续数字组合在一起。
第三步: 分组后,仅为分组长度大于 2 的那些值创建数据框。(我假设 2,因为您没有将 B:2003 & B:2004 视为连续序列。)
工作原理如下:
# Create an Empty dataframe. This is where you will keep appending peices of dataframes
df2 = pd.DataFrame()
# Now you would want to iterate over your unique IDs ie. 'A', 'B'.
for i in unique:
#Create an empty list called Group. Here you will append lists that contain consecutive numbers.
groups = []
#Create a data frame where ID is equal to current iterating ID
df1 = df.loc[df['id'] == i]
#The next 2 for loops (nested) will return group (a list of lists)
for key, group in groupby(enumerate(df1.time.values), lambda ix : ix[0] - ix[1]):
list1 = []
for j in list(group):
list1.append(j[1])
groups.append(list1)
# See how your group for current ID looks
print(groups)
# Iterate within the created group. See if group length is > 2. If yes, append to df2 (the empty data frame that you created earlier)
for j in groups:
if len(j) > 1:
# you are concatenating 2 frames in the below code.
df2 = pd.concat([df2,df.loc[(df['time'].isin(j)) & (df['id'] == i)]])
瞧
>>>> df2
id time value
1 A 2001 0.1
2 A 2002 0.1
3 A 2003 0.1
5 B 1999 0.2
6 B 2000 0.2
7 B 2001 0.2