扩展分层数据,根据列中的列表项创建新行
Expanding heirarchial data, creating new rows based on list item in column
我如何根据下面提供的关于嵌套组的条件来扩展数据框?
组
Name Job Group
[Matt,Adam,John,James] Peon Workers
[Sam,Andrew,John] Boss Leader
[Leader,Ian] Owner Owner
我怎样才能得到如下所示的预期输出:
预期输出:
Name Job Group
Matt Peon Workers
Adam Peon Workers
John Peon Workers
James Peon Workers
Sam Boss Leader
Andrew Boss Leader
John Boss Leader
Sam Owner Owner
Andrew Owner Owner
John Owner Owner
Ian Owner Owner
我当前的方法(不完全有效)提取所有用户,但不识别也等于组名的成员并为每个成员创建一个新条目。
groups.members.apply(lambda x: pd.Series(x)).stack().reset_index(level=1, drop=True).to_frame('members').join(groups[['Job', 'Group']], how='left')
我不确定这是否可以完全在 pandas 中完成。我已经在外部处理了相关数据,之后我重新加入了。
import pandas as pd
groups = pd.DataFrame({'Name': [['Matt','Adam','John','James'], ['Sam','Andrew','John', 'Boss']], 'Job': ['Peon', 'Owner'], 'Group': ['Boss', 'Leader']})
# Build a list of tuples with row to draw group and job from and name
x = [(idx, i) for idx, j in enumerate(groups['Name']) for i in j]
# Search the list for group names, if found resolve group
# names to additional members of row where group was found
for i, j in x:
if j in set(groups.Group):
x.remove((i, j))
for n in list(*list(groups['Name'][groups.Group == j])):
x.append((i, n))
# Create new DataFrame
idx, names = zip(*x)
z = pd.DataFrame(list(names), index=list(idx))
# Join on the old one
groups = groups.drop('Name', axis=1).join(z)
试试这个(将您的数据框命名为 df):
a=pd.DataFrame.from_records(df.name.tolist()).stack().reset_index(level=1, drop=True).rename('name')
df.drop('name', axis=1).join(a).reset_index(drop=True)[['name','job','Group']]
pandas
df.set_index(
['Group', 'Job']
).Name.apply(pd.Series).stack().reset_index([0, 1], name='Name')
Group Job Name
0 Workers Peon Matt
1 Workers Peon Adam
2 Workers Peon John
3 Workers Peon James
0 Leader Boss Sam
1 Leader Boss Andrew
2 Leader Boss John
0 Owner Owner Leader
1 Owner Owner Ian
numpy
name = df.Name.values.tolist()
i = np.arange(len(df)).repeat([len(l) for l in name])
pd.DataFrame(
np.hstack([np.concatenate(name)[:, None], df.drop('Name', 1).values[i]]),
df.index[i], df.columns)
天真的时机
另一个numpy solution
:
from itertools import chain
lens = df.Name.str.len()
df1 = pd.DataFrame({
"Job": np.repeat(df.Job.values, lens),
"Group": np.repeat(df.Group.values, lens),
"Name": list(chain.from_iterable(df.Name))})
print (df1)
Group Job Name
0 Workers Peon Matt
1 Workers Peon Adam
2 Workers Peon John
3 Workers Peon James
4 Leader Boss Sam
5 Leader Boss Andrew
6 Leader Boss John
7 Owner Owner Leader
8 Owner Owner Ian
时间 - 只比较最快的 numpy 解决方案:
import random
import string
from itertools import chain
np.random.seed(123)
N = 100000
L1 = ['Peon','Boss','Owner']
L2 = ['Workers','Leader','Owner']
Jobs = np.random.choice(L1, N)
Groups = np.random.choice(L2, N)
Name = [list(tuple(string.ascii_letters[:random.randint(3, 10)])) for _ in range(N)]
df = pd.DataFrame({"Job":Jobs,"Group":Groups, "Name":Name})
#[100000 rows x 3 columns]
#print (df)
def jez(df):
lens = df.Name.str.len()
return pd.DataFrame({
"Job": np.repeat(df.Job.values, lens),
"Group": np.repeat(df.Group.values, lens),
"Name": list(chain.from_iterable(df.Name))})
def pir(df):
name = df.Name.values.tolist()
i = np.arange(len(df)).repeat([len(l) for l in name])
return pd.DataFrame(
np.hstack([np.concatenate(name)[:, None], df.drop('Name', 1).values[i]]),
df.index[i], df.columns)
print (pir(df))
print (jez(df))
%timeit (pir(df))
1 loop, best of 3: 267 ms per loop
%timeit (jez(df))
10 loops, best of 3: 94 ms per loop
我如何根据下面提供的关于嵌套组的条件来扩展数据框?
组
Name Job Group
[Matt,Adam,John,James] Peon Workers
[Sam,Andrew,John] Boss Leader
[Leader,Ian] Owner Owner
我怎样才能得到如下所示的预期输出:
预期输出:
Name Job Group
Matt Peon Workers
Adam Peon Workers
John Peon Workers
James Peon Workers
Sam Boss Leader
Andrew Boss Leader
John Boss Leader
Sam Owner Owner
Andrew Owner Owner
John Owner Owner
Ian Owner Owner
我当前的方法(不完全有效)提取所有用户,但不识别也等于组名的成员并为每个成员创建一个新条目。
groups.members.apply(lambda x: pd.Series(x)).stack().reset_index(level=1, drop=True).to_frame('members').join(groups[['Job', 'Group']], how='left')
我不确定这是否可以完全在 pandas 中完成。我已经在外部处理了相关数据,之后我重新加入了。
import pandas as pd
groups = pd.DataFrame({'Name': [['Matt','Adam','John','James'], ['Sam','Andrew','John', 'Boss']], 'Job': ['Peon', 'Owner'], 'Group': ['Boss', 'Leader']})
# Build a list of tuples with row to draw group and job from and name
x = [(idx, i) for idx, j in enumerate(groups['Name']) for i in j]
# Search the list for group names, if found resolve group
# names to additional members of row where group was found
for i, j in x:
if j in set(groups.Group):
x.remove((i, j))
for n in list(*list(groups['Name'][groups.Group == j])):
x.append((i, n))
# Create new DataFrame
idx, names = zip(*x)
z = pd.DataFrame(list(names), index=list(idx))
# Join on the old one
groups = groups.drop('Name', axis=1).join(z)
试试这个(将您的数据框命名为 df):
a=pd.DataFrame.from_records(df.name.tolist()).stack().reset_index(level=1, drop=True).rename('name')
df.drop('name', axis=1).join(a).reset_index(drop=True)[['name','job','Group']]
pandas
df.set_index(
['Group', 'Job']
).Name.apply(pd.Series).stack().reset_index([0, 1], name='Name')
Group Job Name
0 Workers Peon Matt
1 Workers Peon Adam
2 Workers Peon John
3 Workers Peon James
0 Leader Boss Sam
1 Leader Boss Andrew
2 Leader Boss John
0 Owner Owner Leader
1 Owner Owner Ian
numpy
name = df.Name.values.tolist()
i = np.arange(len(df)).repeat([len(l) for l in name])
pd.DataFrame(
np.hstack([np.concatenate(name)[:, None], df.drop('Name', 1).values[i]]),
df.index[i], df.columns)
天真的时机
另一个numpy solution
:
from itertools import chain
lens = df.Name.str.len()
df1 = pd.DataFrame({
"Job": np.repeat(df.Job.values, lens),
"Group": np.repeat(df.Group.values, lens),
"Name": list(chain.from_iterable(df.Name))})
print (df1)
Group Job Name
0 Workers Peon Matt
1 Workers Peon Adam
2 Workers Peon John
3 Workers Peon James
4 Leader Boss Sam
5 Leader Boss Andrew
6 Leader Boss John
7 Owner Owner Leader
8 Owner Owner Ian
时间 - 只比较最快的 numpy 解决方案:
import random
import string
from itertools import chain
np.random.seed(123)
N = 100000
L1 = ['Peon','Boss','Owner']
L2 = ['Workers','Leader','Owner']
Jobs = np.random.choice(L1, N)
Groups = np.random.choice(L2, N)
Name = [list(tuple(string.ascii_letters[:random.randint(3, 10)])) for _ in range(N)]
df = pd.DataFrame({"Job":Jobs,"Group":Groups, "Name":Name})
#[100000 rows x 3 columns]
#print (df)
def jez(df):
lens = df.Name.str.len()
return pd.DataFrame({
"Job": np.repeat(df.Job.values, lens),
"Group": np.repeat(df.Group.values, lens),
"Name": list(chain.from_iterable(df.Name))})
def pir(df):
name = df.Name.values.tolist()
i = np.arange(len(df)).repeat([len(l) for l in name])
return pd.DataFrame(
np.hstack([np.concatenate(name)[:, None], df.drop('Name', 1).values[i]]),
df.index[i], df.columns)
print (pir(df))
print (jez(df))
%timeit (pir(df))
1 loop, best of 3: 267 ms per loop
%timeit (jez(df))
10 loops, best of 3: 94 ms per loop