数据框中行和列之间的交互
Interaction between rows and columns in dataframe
我有一个数据框:
df = pd.DataFrame({
'exam': [
'French', 'English', 'German', 'Russian', 'Russian',
'German', 'German', 'French', 'English', 'French'
],
'student' : ['john', 'ted', 'jason', 'marc', 'peter', 'bob',
'robert', 'david', 'nik', 'kevin'
]
})
print (df)
exam student
0 French john
1 English ted
2 German jason
3 Russian marc
4 Russian peter
5 German bob
6 German robert
7 French david
8 English nik
9 French kevin
有谁知道如何创建一个包含两列“student”和“student shared exam with”的新数据框。
我应该得到类似的东西:
student shared_exam_with
0 john david
1 john kevin
2 ted nik
3 jason bob
4 jason robert
5 marc peter
6 peter marc
7 bob jason
8 bob robert
9 robert jason
10 robert bob
11 david john
12 david kevin
13 nik ted
14 kevin john
15 kevin david
例如:John 学了法语..还有 David 和 Kevin!
一种方法是:
cross = pd.crosstab(df['student'], df['exam'])
res = cross.dot(cross.T)
res.where(np.triu(res, k=1).astype('bool')).stack()
Out:
student student
bob jason 1.0
robert 1.0
david john 1.0
kevin 1.0
jason robert 1.0
john kevin 1.0
marc peter 1.0
nik ted 1.0
dtype: float64
点积生成同时出现的二进制矩阵。为了不重复相同的对,我用 where 和 stack 过滤它们。结果系列的索引是参加相同考试的学生。
自己merge
df.merge(
df, on='exam',
suffixes=['', '_shared_with']
).query('student != student_shared_with')
exam student student_shared_with
1 French john david
2 French john kevin
3 French david john
5 French david kevin
6 French kevin john
7 French kevin david
10 English ted nik
11 English nik ted
14 German jason bob
15 German jason robert
16 German bob jason
18 German bob robert
19 German robert jason
20 German robert bob
23 Russian marc peter
24 Russian peter marc
自己join
d1 = df.set_index('exam')
d1.join(
d1, rsuffix='_shared_with'
).query('student != student_shared_with')
student student_shared_with
exam
English ted nik
English nik ted
French john david
French john kevin
French david john
French david kevin
French kevin john
French kevin david
German jason bob
German jason robert
German bob jason
German bob robert
German robert jason
German robert bob
Russian marc peter
Russian peter marc
itertools.permutations
+ groupby
from itertools import permutations as perm
cols = ['student', 'student_shared_with']
df.groupby('exam').student.apply(
lambda x: pd.DataFrame(list(perm(x, 2)), columns=cols)
).reset_index(drop=True)
student student_shared_with
0 ted nik
1 nik ted
2 john david
3 john kevin
4 david john
5 david kevin
6 kevin john
7 kevin david
8 jason bob
9 jason robert
10 bob jason
11 bob robert
12 robert jason
13 robert bob
14 marc peter
15 peter marc
在 SQL 中这将是一步过程,但这里有两个过程:(1) 将 DataFrame(在考试中)与其自身合并,以及 (2) 删除学生行 == student_shared(因为学生不与自己分享)
df2 = pd.merge(
df, df, how='outer', on='exam', suffixes=['', '_shared_with']).drop('exam', axis=1)
df2 = df2.loc[df2.student != df2.student_shared_with]
student student_shared_with
1 john david
2 john kevin
3 david john
5 david kevin
6 kevin john
7 kevin david
10 ted nik
11 nik ted
14 jason bob
15 jason robert
16 bob jason
18 bob robert
19 robert jason
20 robert bob
23 marc peter
24 peter marc
我有一个数据框:
df = pd.DataFrame({
'exam': [
'French', 'English', 'German', 'Russian', 'Russian',
'German', 'German', 'French', 'English', 'French'
],
'student' : ['john', 'ted', 'jason', 'marc', 'peter', 'bob',
'robert', 'david', 'nik', 'kevin'
]
})
print (df)
exam student
0 French john
1 English ted
2 German jason
3 Russian marc
4 Russian peter
5 German bob
6 German robert
7 French david
8 English nik
9 French kevin
有谁知道如何创建一个包含两列“student”和“student shared exam with”的新数据框。
我应该得到类似的东西:
student shared_exam_with
0 john david
1 john kevin
2 ted nik
3 jason bob
4 jason robert
5 marc peter
6 peter marc
7 bob jason
8 bob robert
9 robert jason
10 robert bob
11 david john
12 david kevin
13 nik ted
14 kevin john
15 kevin david
例如:John 学了法语..还有 David 和 Kevin!
一种方法是:
cross = pd.crosstab(df['student'], df['exam'])
res = cross.dot(cross.T)
res.where(np.triu(res, k=1).astype('bool')).stack()
Out:
student student
bob jason 1.0
robert 1.0
david john 1.0
kevin 1.0
jason robert 1.0
john kevin 1.0
marc peter 1.0
nik ted 1.0
dtype: float64
点积生成同时出现的二进制矩阵。为了不重复相同的对,我用 where 和 stack 过滤它们。结果系列的索引是参加相同考试的学生。
自己merge
df.merge(
df, on='exam',
suffixes=['', '_shared_with']
).query('student != student_shared_with')
exam student student_shared_with
1 French john david
2 French john kevin
3 French david john
5 French david kevin
6 French kevin john
7 French kevin david
10 English ted nik
11 English nik ted
14 German jason bob
15 German jason robert
16 German bob jason
18 German bob robert
19 German robert jason
20 German robert bob
23 Russian marc peter
24 Russian peter marc
自己join
d1 = df.set_index('exam')
d1.join(
d1, rsuffix='_shared_with'
).query('student != student_shared_with')
student student_shared_with
exam
English ted nik
English nik ted
French john david
French john kevin
French david john
French david kevin
French kevin john
French kevin david
German jason bob
German jason robert
German bob jason
German bob robert
German robert jason
German robert bob
Russian marc peter
Russian peter marc
itertools.permutations
+ groupby
from itertools import permutations as perm
cols = ['student', 'student_shared_with']
df.groupby('exam').student.apply(
lambda x: pd.DataFrame(list(perm(x, 2)), columns=cols)
).reset_index(drop=True)
student student_shared_with
0 ted nik
1 nik ted
2 john david
3 john kevin
4 david john
5 david kevin
6 kevin john
7 kevin david
8 jason bob
9 jason robert
10 bob jason
11 bob robert
12 robert jason
13 robert bob
14 marc peter
15 peter marc
在 SQL 中这将是一步过程,但这里有两个过程:(1) 将 DataFrame(在考试中)与其自身合并,以及 (2) 删除学生行 == student_shared(因为学生不与自己分享)
df2 = pd.merge(
df, df, how='outer', on='exam', suffixes=['', '_shared_with']).drop('exam', axis=1)
df2 = df2.loc[df2.student != df2.student_shared_with]
student student_shared_with
1 john david
2 john kevin
3 david john
5 david kevin
6 kevin john
7 kevin david
10 ted nik
11 nik ted
14 jason bob
15 jason robert
16 bob jason
18 bob robert
19 robert jason
20 robert bob
23 marc peter
24 peter marc