数据框中行和列之间的交互

Interaction between rows and columns in dataframe

我有一个数据框:

df = pd.DataFrame({
    'exam': [
        'French', 'English', 'German', 'Russian', 'Russian',
        'German', 'German', 'French', 'English', 'French'
    ],

'student' : ['john', 'ted', 'jason', 'marc', 'peter', 'bob',
            'robert', 'david', 'nik', 'kevin'
]
})

print (df)

              exam   student   
    0       French    john     
    1       English   ted        
    2       German    jason         
    3       Russian   marc         
    4       Russian   peter         
    5       German    bob         
    6       German    robert         
    7       French    david         
    8       English   nik          
    9       French    kevin         

有谁知道如何创建一个包含两列“student”和“student shared exam with”的新数据框。

我应该得到类似的东西:

                student   shared_exam_with      
        0       john       david                   
        1       john       kevin            
        2       ted        nik                    
        3       jason      bob                 
        4       jason      robert                   
        5       marc       peter              
        6       peter      marc             
        7       bob        jason                    
        8       bob        robert                    
        9       robert     jason                      
       10       robert     bob                   
       11       david      john             
       12       david      kevin                      
       13       nik        ted                     
       14       kevin      john                     
       15       kevin      david                   

例如:John 学了法语..还有 David 和 Kevin!

一种方法是:

cross = pd.crosstab(df['student'], df['exam'])
res = cross.dot(cross.T)
res.where(np.triu(res, k=1).astype('bool')).stack()
Out: 
student  student
bob      jason      1.0
         robert     1.0
david    john       1.0
         kevin      1.0
jason    robert     1.0
john     kevin      1.0
marc     peter      1.0
nik      ted        1.0
dtype: float64

点积生成同时出现的二进制矩阵。为了不重复相同的对,我用 where 和 stack 过滤它们。结果系列的索引是参加相同考试的学生。

自己merge

df.merge(
    df, on='exam',
    suffixes=['', '_shared_with']
).query('student != student_shared_with')

       exam student student_shared_with
1    French    john               david
2    French    john               kevin
3    French   david                john
5    French   david               kevin
6    French   kevin                john
7    French   kevin               david
10  English     ted                 nik
11  English     nik                 ted
14   German   jason                 bob
15   German   jason              robert
16   German     bob               jason
18   German     bob              robert
19   German  robert               jason
20   German  robert                 bob
23  Russian    marc               peter
24  Russian   peter                marc

自己join

d1 = df.set_index('exam')
d1.join(
    d1, rsuffix='_shared_with'
).query('student != student_shared_with')

        student student_shared_with
exam                               
English     ted                 nik
English     nik                 ted
French     john               david
French     john               kevin
French    david                john
French    david               kevin
French    kevin                john
French    kevin               david
German    jason                 bob
German    jason              robert
German      bob               jason
German      bob              robert
German   robert               jason
German   robert                 bob
Russian    marc               peter
Russian   peter                marc

itertools.permutations + groupby

from itertools import permutations as perm

cols = ['student', 'student_shared_with']
df.groupby('exam').student.apply(
    lambda x: pd.DataFrame(list(perm(x, 2)), columns=cols)
).reset_index(drop=True)

   student student_shared_with
0      ted                 nik
1      nik                 ted
2     john               david
3     john               kevin
4    david                john
5    david               kevin
6    kevin                john
7    kevin               david
8    jason                 bob
9    jason              robert
10     bob               jason
11     bob              robert
12  robert               jason
13  robert                 bob
14    marc               peter
15   peter                marc

在 SQL 中这将是一步过程,但这里有两个过程:(1) 将 DataFrame(在考试中)与其自身合并,以及 (2) 删除学生行 == student_shared(因为学生不与自己分享)

df2 = pd.merge(
    df, df, how='outer', on='exam', suffixes=['', '_shared_with']).drop('exam', axis=1)
df2 = df2.loc[df2.student != df2.student_shared_with]

   student student_shared_with
1     john               david
2     john               kevin
3    david                john
5    david               kevin
6    kevin                john
7    kevin               david
10     ted                 nik
11     nik                 ted
14   jason                 bob
15   jason              robert
16     bob               jason
18     bob              robert
19  robert               jason
20  robert                 bob
23    marc               peter
24   peter                marc