Pandas: 根据空行拆分数据框
Pandas: Split data frame based on empty rows
我有以下数据框。
id A B C
1 34353 917998 x
2 34973 980340 x
3 87365 498097 x
4 98309 486547 x
5 87699 475132
6 52734 4298894
7 8749267 4918066 x
8 89872 18103
9 589892 4818086 y
10 765 4063 y
11 32369 418165 y
12 206 2918137
13 554 3918072
14 1029 1918051 x
15 2349243 4918064
对于每组空行,例如5,6 我想创建一个新的数据框。需要产生多个数据帧。如下所示:
id A B
5 87699 475132
6 52734 4298894
id A B
8 89872 18103
id A B
12 206 2918137
13 554 3918072
id A B
15 2349243 4918064
试试这个:
x = df[pd.isnull(df.C)]
splitter = x.reset_index()[(x['id'].diff().fillna(0) > 1).reset_index(drop=True)].index
dfs = np.split(x, splitter)
for x in dfs:
print(x, '\n')
输出:
In [264]: for x in l:
.....: print(x, '\n')
.....:
id A B C
4 5 87699 475132 NaN
5 6 52734 4298894 NaN
id A B C
7 8 89872 18103 NaN
id A B C
11 12 206 2918137 NaN
12 13 554 3918072 NaN
id A B C
14 15 2349243 4918064 NaN
解释:
In [267]: x = df[pd.isnull(df.C)]
In [268]: x
Out[268]:
id A B C
4 5 87699 475132 NaN
5 6 52734 4298894 NaN
7 8 89872 18103 NaN
11 12 206 2918137 NaN
12 13 554 3918072 NaN
14 15 2349243 4918064 NaN
In [269]: x.ix[pd.isnull(df.C), 'id']
Out[269]:
4 5
5 6
7 8
11 12
12 13
14 15
Name: id, dtype: int64
In [270]: x['id'].diff().fillna(0)
Out[270]:
4 0.0
5 1.0
7 2.0
11 4.0
12 1.0
14 2.0
Name: id, dtype: float64
In [271]: x['id'].diff().fillna(0) > 1
Out[271]:
4 False
5 False
7 True
11 True
12 False
14 True
Name: id, dtype: bool
In [272]: (x['id'].diff().fillna(0) > 1).reset_index(drop=True)
Out[272]:
0 False
1 False
2 True
3 True
4 False
5 True
Name: id, dtype: bool
In [273]: x.reset_index()[x['id'].diff().fillna(0) > 1).reset_index(drop=True)]
Out[273]:
index id A B C
2 7 8 89872 18103 NaN
3 11 12 206 2918137 NaN
5 14 15 2349243 4918064 NaN
In [274]: x.reset_index()[(x['id'].diff().fillna(0) > 1).reset_index(drop=True)].index
Out[274]: Int64Index([2, 3, 5], dtype='int64')
这里有一个使用 itertools.groupby
的稍微复杂、可能不是很快的解决方案(著名的是将连续的相似值序列集中在一起)。
from itertools import groupby
import io
import pandas as pd
# Create your sample dataframe
data = io.StringIO("""\
id A B C
1 34353 917998 x
2 34973 980340 x
3 87365 498097 x
4 98309 486547 x
5 87699 475132
6 52734 4298894
7 8749267 4918066 x
8 89872 18103
9 589892 4818086 y
10 765 4063 y
11 32369 418165 y
12 206 2918137
13 554 3918072
14 1029 1918051 x
15 2349243 4918064
""")
df = pd.read_csv(data, delim_whitespace=True)
# Create a boolean column that encodes which rows you want to keep
df['grouper'] = df['C'].notnull()
# Isolate the indices of the rows you want to keep, grouped by contiguity
groups = [list(map(lambda x: x[1]['id'], list(l)))
for k, l in groupby(df.iterrows(), key=lambda x: x[1]['grouper'])
if not k]
print(groups) # => [[5, 6], [8], [12, 13], [15]]
# Gather the sub-dataframes whose indices match `groups`
dfs = []
for g in groups:
dfs.append(df[['A', 'B']][df['id'].isin(g)])
# Inspect what you got
for df in dfs:
print(df)
输出:
A B
4 87699 475132
5 52734 4298894
A B
7 89872 18103
A B
11 206 2918137
12 554 3918072
A B
14 2349243 4918064
isnull = df.C.isnull()
partitions = (isnull != isnull.shift()).cumsum()
gb = df[isnull].groupby(partitions)
至此,我们已经完成了为 df
中每个连续的 NaN
组创建单独数据框的目标。可以通过 gb.get_group()
方法访问 gb.groups
中的每个键
为了验证,我们将拼接显示。
keys = gb.groups.keys()
dfs = pd.concat([gb.get_group(g) for g in keys], keys=keys)
dfs
设置 df
我使用了@Alberto Garcia-Raboso 的reader
import io
import pandas as pd
# Create your sample dataframe
data = io.StringIO("""\
id A B C
1 34353 917998 x
2 34973 980340 x
3 87365 498097 x
4 98309 486547 x
5 87699 475132
6 52734 4298894
7 8749267 4918066 x
8 89872 18103
9 589892 4818086 y
10 765 4063 y
11 32369 418165 y
12 206 2918137
13 554 3918072
14 1029 1918051 x
15 2349243 4918064
""")
df = pd.read_csv(data, delim_whitespace=True)
我有以下数据框。
id A B C 1 34353 917998 x 2 34973 980340 x 3 87365 498097 x 4 98309 486547 x 5 87699 475132 6 52734 4298894 7 8749267 4918066 x 8 89872 18103 9 589892 4818086 y 10 765 4063 y 11 32369 418165 y 12 206 2918137 13 554 3918072 14 1029 1918051 x 15 2349243 4918064
对于每组空行,例如5,6 我想创建一个新的数据框。需要产生多个数据帧。如下所示:
id A B 5 87699 475132 6 52734 4298894
id A B 8 89872 18103
id A B 12 206 2918137 13 554 3918072
id A B 15 2349243 4918064
试试这个:
x = df[pd.isnull(df.C)]
splitter = x.reset_index()[(x['id'].diff().fillna(0) > 1).reset_index(drop=True)].index
dfs = np.split(x, splitter)
for x in dfs:
print(x, '\n')
输出:
In [264]: for x in l:
.....: print(x, '\n')
.....:
id A B C
4 5 87699 475132 NaN
5 6 52734 4298894 NaN
id A B C
7 8 89872 18103 NaN
id A B C
11 12 206 2918137 NaN
12 13 554 3918072 NaN
id A B C
14 15 2349243 4918064 NaN
解释:
In [267]: x = df[pd.isnull(df.C)]
In [268]: x
Out[268]:
id A B C
4 5 87699 475132 NaN
5 6 52734 4298894 NaN
7 8 89872 18103 NaN
11 12 206 2918137 NaN
12 13 554 3918072 NaN
14 15 2349243 4918064 NaN
In [269]: x.ix[pd.isnull(df.C), 'id']
Out[269]:
4 5
5 6
7 8
11 12
12 13
14 15
Name: id, dtype: int64
In [270]: x['id'].diff().fillna(0)
Out[270]:
4 0.0
5 1.0
7 2.0
11 4.0
12 1.0
14 2.0
Name: id, dtype: float64
In [271]: x['id'].diff().fillna(0) > 1
Out[271]:
4 False
5 False
7 True
11 True
12 False
14 True
Name: id, dtype: bool
In [272]: (x['id'].diff().fillna(0) > 1).reset_index(drop=True)
Out[272]:
0 False
1 False
2 True
3 True
4 False
5 True
Name: id, dtype: bool
In [273]: x.reset_index()[x['id'].diff().fillna(0) > 1).reset_index(drop=True)]
Out[273]:
index id A B C
2 7 8 89872 18103 NaN
3 11 12 206 2918137 NaN
5 14 15 2349243 4918064 NaN
In [274]: x.reset_index()[(x['id'].diff().fillna(0) > 1).reset_index(drop=True)].index
Out[274]: Int64Index([2, 3, 5], dtype='int64')
这里有一个使用 itertools.groupby
的稍微复杂、可能不是很快的解决方案(著名的是将连续的相似值序列集中在一起)。
from itertools import groupby
import io
import pandas as pd
# Create your sample dataframe
data = io.StringIO("""\
id A B C
1 34353 917998 x
2 34973 980340 x
3 87365 498097 x
4 98309 486547 x
5 87699 475132
6 52734 4298894
7 8749267 4918066 x
8 89872 18103
9 589892 4818086 y
10 765 4063 y
11 32369 418165 y
12 206 2918137
13 554 3918072
14 1029 1918051 x
15 2349243 4918064
""")
df = pd.read_csv(data, delim_whitespace=True)
# Create a boolean column that encodes which rows you want to keep
df['grouper'] = df['C'].notnull()
# Isolate the indices of the rows you want to keep, grouped by contiguity
groups = [list(map(lambda x: x[1]['id'], list(l)))
for k, l in groupby(df.iterrows(), key=lambda x: x[1]['grouper'])
if not k]
print(groups) # => [[5, 6], [8], [12, 13], [15]]
# Gather the sub-dataframes whose indices match `groups`
dfs = []
for g in groups:
dfs.append(df[['A', 'B']][df['id'].isin(g)])
# Inspect what you got
for df in dfs:
print(df)
输出:
A B
4 87699 475132
5 52734 4298894
A B
7 89872 18103
A B
11 206 2918137
12 554 3918072
A B
14 2349243 4918064
isnull = df.C.isnull()
partitions = (isnull != isnull.shift()).cumsum()
gb = df[isnull].groupby(partitions)
至此,我们已经完成了为 df
中每个连续的 NaN
组创建单独数据框的目标。可以通过 gb.get_group()
方法访问 gb.groups
为了验证,我们将拼接显示。
keys = gb.groups.keys()
dfs = pd.concat([gb.get_group(g) for g in keys], keys=keys)
dfs
设置 df
我使用了@Alberto Garcia-Raboso 的reader
import io
import pandas as pd
# Create your sample dataframe
data = io.StringIO("""\
id A B C
1 34353 917998 x
2 34973 980340 x
3 87365 498097 x
4 98309 486547 x
5 87699 475132
6 52734 4298894
7 8749267 4918066 x
8 89872 18103
9 589892 4818086 y
10 765 4063 y
11 32369 418165 y
12 206 2918137
13 554 3918072
14 1029 1918051 x
15 2349243 4918064
""")
df = pd.read_csv(data, delim_whitespace=True)