如何在一个热编码数据帧中找到独特的组合?
How to find unique combinations in one hot encoded dataframe?
我有一个名为 test 的数据框,看起来像这样
+-------+---------+---------+---------+------------+
| | Term 1 | Term 2 | Term 3 | Final Exam |
+-------+---------+---------+---------+------------+
| 1288 | 0 | 0 | 1 | 1 |
| 1290 | 1 | 1 | 1 | 1 |
| 1294 | 0 | 0 | 1 | 1 |
| 1296 | 1 | 1 | 1 | 1 |
| 1297 | 1 | 1 | 1 | 1 |
| 1304 | 0 | 1 | 1 | 1 |
| 1308 | 0 | 0 | 1 | 1 |
| 1324 | 1 | 1 | 1 | 1 |
| 1325 | 1 | 1 | 1 | 1 |
| 1332 | 1 | 1 | 1 | 1 |
+-------+---------+---------+---------+------------+
我想要摘要 table 列 = 1 的所有唯一组合及其出现次数:
+-----------------------------------+-----------+
| Combination | Frequency |
+-----------------------------------+-----------+
| Term 3, Final Exam | 3 |
| Term 2, Term 3, Final Exam | 1 |
| Term 1, Term2, Term 3, Final Exam | 6 |
+-----------------------------------+-----------+
我试过使用 mlxtend.apriori,但这让我出现了所有的列:
from mlxtend.frequent_patterns import apriori
results = apriori(test,min_support=0.00001,use_colnames=True)
results['length'] = results['itemsets'].apply(lambda x:len(x))
numberofcases = test.shape[0]
results['Frequency'] = results['support'] * numberofcases
results['Terms'] = results['itemsets'].astype(str).str.replace('frozenset\({','').str.replace('}\)','').str.replace('\'','').str.replace('\"','')
results[results['length'] > 1][['Terms','Frequency']]
结果集:
+-----+-------------------------------------+-----------+
| | Terms | Frequency |
+-----+-------------------------------------+-----------+
| 4 | Term 2, Term 1 | 6.0 |
| 5 | Term 3, Term 1 | 6.0 |
| 6 | Final Exam, Term 1 | 6.0 |
| 7 | Term 2, Term 3 | 7.0 |
| 8 | Term 2, Final Exam | 7.0 |
| 9 | Term 3, Final Exam | 10.0 |
| 10 | Term 2, Term 3, Term 1 | 6.0 |
| 11 | Term 2, Final Exam, Term 1 | 6.0 |
| 12 | Term 3, Final Exam, Term 1 | 6.0 |
| 13 | Term 2, Term 3, Final Exam | 7.0 |
| 14 | Term 2, Term 3, Final Exam, Term 1 | 6.0 |
+-----+-------------------------------------+-----------+
apriori 中是否有一些参数可以产生所需的结果或其他一些方法?
有dot
和value_counts
df.dot(df.columns+',').str[:-1].value_counts()
Out[419]:
Term1,Term2,Term3,FinalExam 6
Term3,FinalExam 3
Term2,Term3,FinalExam 1
dtype: int64
我有一个名为 test 的数据框,看起来像这样
+-------+---------+---------+---------+------------+
| | Term 1 | Term 2 | Term 3 | Final Exam |
+-------+---------+---------+---------+------------+
| 1288 | 0 | 0 | 1 | 1 |
| 1290 | 1 | 1 | 1 | 1 |
| 1294 | 0 | 0 | 1 | 1 |
| 1296 | 1 | 1 | 1 | 1 |
| 1297 | 1 | 1 | 1 | 1 |
| 1304 | 0 | 1 | 1 | 1 |
| 1308 | 0 | 0 | 1 | 1 |
| 1324 | 1 | 1 | 1 | 1 |
| 1325 | 1 | 1 | 1 | 1 |
| 1332 | 1 | 1 | 1 | 1 |
+-------+---------+---------+---------+------------+
我想要摘要 table 列 = 1 的所有唯一组合及其出现次数:
+-----------------------------------+-----------+
| Combination | Frequency |
+-----------------------------------+-----------+
| Term 3, Final Exam | 3 |
| Term 2, Term 3, Final Exam | 1 |
| Term 1, Term2, Term 3, Final Exam | 6 |
+-----------------------------------+-----------+
我试过使用 mlxtend.apriori,但这让我出现了所有的列:
from mlxtend.frequent_patterns import apriori
results = apriori(test,min_support=0.00001,use_colnames=True)
results['length'] = results['itemsets'].apply(lambda x:len(x))
numberofcases = test.shape[0]
results['Frequency'] = results['support'] * numberofcases
results['Terms'] = results['itemsets'].astype(str).str.replace('frozenset\({','').str.replace('}\)','').str.replace('\'','').str.replace('\"','')
results[results['length'] > 1][['Terms','Frequency']]
结果集:
+-----+-------------------------------------+-----------+
| | Terms | Frequency |
+-----+-------------------------------------+-----------+
| 4 | Term 2, Term 1 | 6.0 |
| 5 | Term 3, Term 1 | 6.0 |
| 6 | Final Exam, Term 1 | 6.0 |
| 7 | Term 2, Term 3 | 7.0 |
| 8 | Term 2, Final Exam | 7.0 |
| 9 | Term 3, Final Exam | 10.0 |
| 10 | Term 2, Term 3, Term 1 | 6.0 |
| 11 | Term 2, Final Exam, Term 1 | 6.0 |
| 12 | Term 3, Final Exam, Term 1 | 6.0 |
| 13 | Term 2, Term 3, Final Exam | 7.0 |
| 14 | Term 2, Term 3, Final Exam, Term 1 | 6.0 |
+-----+-------------------------------------+-----------+
apriori 中是否有一些参数可以产生所需的结果或其他一些方法?
有dot
和value_counts
df.dot(df.columns+',').str[:-1].value_counts()
Out[419]:
Term1,Term2,Term3,FinalExam 6
Term3,FinalExam 3
Term2,Term3,FinalExam 1
dtype: int64