扩展分层数据,根据列中的列表项创建新行

Expanding heirarchial data, creating new rows based on list item in column

我如何根据下面提供的关于嵌套组的条件来扩展数据框?

Name                    Job        Group
[Matt,Adam,John,James]  Peon       Workers
[Sam,Andrew,John]       Boss       Leader
[Leader,Ian]            Owner      Owner

我怎样才能得到如下所示的预期输出:

预期输出:

Name    Job       Group
Matt    Peon      Workers
Adam    Peon      Workers
John    Peon      Workers
James   Peon      Workers
Sam     Boss      Leader
Andrew  Boss      Leader
John    Boss      Leader
Sam     Owner     Owner
Andrew  Owner     Owner
John    Owner     Owner
Ian     Owner     Owner

我当前的方法(不完全有效)提取所有用户,但不识别也等于组名的成员并为每个成员创建一个新条目。

groups.members.apply(lambda x: pd.Series(x)).stack().reset_index(level=1, drop=True).to_frame('members').join(groups[['Job', 'Group']], how='left')

我不确定这是否可以完全在 pandas 中完成。我已经在外部处理了相关数据,之后我重新加入了。

import pandas as pd

groups = pd.DataFrame({'Name': [['Matt','Adam','John','James'], ['Sam','Andrew','John', 'Boss']], 'Job': ['Peon', 'Owner'], 'Group': ['Boss', 'Leader']})

# Build a list of tuples with row to draw group and job from and name
x = [(idx, i) for idx, j in enumerate(groups['Name']) for i in j]

# Search the list for group names, if found resolve group 
# names to additional members of row where group was found
for i, j in x:
    if j in set(groups.Group):
            x.remove((i, j))
            for n in list(*list(groups['Name'][groups.Group == j])):
                x.append((i, n))

# Create new DataFrame
idx, names = zip(*x)
z = pd.DataFrame(list(names), index=list(idx))

# Join on the old one
groups = groups.drop('Name', axis=1).join(z)

试试这个(将您的数据框命名为 df):

a=pd.DataFrame.from_records(df.name.tolist()).stack().reset_index(level=1, drop=True).rename('name')
df.drop('name', axis=1).join(a).reset_index(drop=True)[['name','job','Group']]

pandas

df.set_index(
    ['Group', 'Job']
).Name.apply(pd.Series).stack().reset_index([0, 1], name='Name')

     Group    Job    Name
0  Workers   Peon    Matt
1  Workers   Peon    Adam
2  Workers   Peon    John
3  Workers   Peon   James
0   Leader   Boss     Sam
1   Leader   Boss  Andrew
2   Leader   Boss    John
0    Owner  Owner  Leader
1    Owner  Owner     Ian

numpy

name = df.Name.values.tolist()
i = np.arange(len(df)).repeat([len(l) for l in name])

pd.DataFrame(
    np.hstack([np.concatenate(name)[:, None], df.drop('Name', 1).values[i]]),
    df.index[i], df.columns)

天真的时机

另一个numpy solution:

from  itertools import chain

lens = df.Name.str.len()
df1 = pd.DataFrame({
        "Job": np.repeat(df.Job.values, lens),
         "Group": np.repeat(df.Group.values, lens),
        "Name": list(chain.from_iterable(df.Name))})
print (df1)
     Group    Job    Name
0  Workers   Peon    Matt
1  Workers   Peon    Adam
2  Workers   Peon    John
3  Workers   Peon   James
4   Leader   Boss     Sam
5   Leader   Boss  Andrew
6   Leader   Boss    John
7    Owner  Owner  Leader
8    Owner  Owner     Ian     

时间 - 只比较最快的 numpy 解决方案:

import random
import string
from  itertools import chain

np.random.seed(123)
N = 100000
L1 = ['Peon','Boss','Owner']
L2 = ['Workers','Leader','Owner']
Jobs = np.random.choice(L1, N)
Groups = np.random.choice(L2, N)
Name = [list(tuple(string.ascii_letters[:random.randint(3, 10)])) for _ in range(N)]
df = pd.DataFrame({"Job":Jobs,"Group":Groups, "Name":Name})
#[100000 rows x 3 columns]
#print (df)
def jez(df):
    lens = df.Name.str.len()
    return pd.DataFrame({
            "Job": np.repeat(df.Job.values, lens),
            "Group": np.repeat(df.Group.values, lens),
            "Name": list(chain.from_iterable(df.Name))})

def pir(df):
    name = df.Name.values.tolist()
    i = np.arange(len(df)).repeat([len(l) for l in name])

    return pd.DataFrame(
        np.hstack([np.concatenate(name)[:, None], df.drop('Name', 1).values[i]]),
        df.index[i], df.columns)

print (pir(df))
print (jez(df))

%timeit (pir(df))
1 loop, best of 3: 267 ms per loop

%timeit (jez(df))
10 loops, best of 3: 94 ms per loop