根据列值拆分 Python 数据框,然后在算法中使用它们
Splitting up a Python dataframe based on column value and then using them in algorithm
我目前正在使用来自 mlxtend 的 Apriori 算法进行简单的频繁模式分析。目前,我只是在查看所有交易。但我想根据国家区分分析。我当前的脚本如下所示:
import pandas as pd
import numpy as np
import pyodbc
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
dataset = pd.read_sql_query("""some query"", cnxn)
# Transform/prep dataset into list data
dataset_tx = dataset.groupby(['ReceiptCode'])['ItemCategoryName'].apply(list).values.tolist()
# Define classifier
te = TransactionEncoder()
# Binary-transform dataset
te_ary = te.fit(dataset_tx).transform(dataset_tx)
# Fit to new dataframe (sparse dataframe)
df = pd.SparseDataFrame(te_ary, columns=te.columns_)
# Run algorithm
frequent_itemsets = apriori(df, min_support=0.10, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.3)
下面是dataset
的例子。
+----------------------+--+------------------+--+------------------+
| ReceiptCode | | ItemCategoryName | | StoreCountryName |
+----------------------+--+------------------+--+------------------+
| 0000P70322000031467 | | Food | | Denmark |
| 0000P70322000031867 | | Food | | Denmark |
| 0000P70322000051467 | | Interior | | Germany |
| 0000P70322000087468 | | Kitchen | | Switzerland |
| 0000P70322000031469 | | Leisure | | Germany |
| 0000P70322000031439 | | Food | | Switzerland |
+----------------------+--+------------------+--+------------------+
是否可以"automatically"基于列StoreCountryName
创建多个数据框,然后在算法中使用它,即在分析中使用特定国家/地区的数据框并遍历所有国家/地区?我知道我可以手动创建数据框,然后应用转换和分析。
您可以 groupby
并进行列表理解以将数据帧存储在列表中,然后遍历它们:
g = df.groupby('StoreCountryName')
dfs = [group for _,group in g]
for i in range(len(dfs)):
dfs[i]['iteration'] = i # do stuff to each frame
print(f"{dfs[i]} \n")
ReceiptCode ItemCategoryName StoreCountryName iteration
0 0000P70322000031467 Food Denmark 0
1 0000P70322000031867 Food Denmark 0
ReceiptCode ItemCategoryName StoreCountryName iteration
2 0000P70322000051467 Interior Germany 1
4 0000P70322000031469 Leisure Germany 1
ReceiptCode ItemCategoryName StoreCountryName iteration
3 0000P70322000087468 Kitchen Switzerland 2
5 0000P70322000031439 Food Switzerland 2
或者您可以创建一个函数并使用 groupby
和 apply
def myFunc(country):
# do stuff
df.groupby('StoreCountryName').apply(myFunc)
我目前正在使用来自 mlxtend 的 Apriori 算法进行简单的频繁模式分析。目前,我只是在查看所有交易。但我想根据国家区分分析。我当前的脚本如下所示:
import pandas as pd
import numpy as np
import pyodbc
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
dataset = pd.read_sql_query("""some query"", cnxn)
# Transform/prep dataset into list data
dataset_tx = dataset.groupby(['ReceiptCode'])['ItemCategoryName'].apply(list).values.tolist()
# Define classifier
te = TransactionEncoder()
# Binary-transform dataset
te_ary = te.fit(dataset_tx).transform(dataset_tx)
# Fit to new dataframe (sparse dataframe)
df = pd.SparseDataFrame(te_ary, columns=te.columns_)
# Run algorithm
frequent_itemsets = apriori(df, min_support=0.10, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.3)
下面是dataset
的例子。
+----------------------+--+------------------+--+------------------+
| ReceiptCode | | ItemCategoryName | | StoreCountryName |
+----------------------+--+------------------+--+------------------+
| 0000P70322000031467 | | Food | | Denmark |
| 0000P70322000031867 | | Food | | Denmark |
| 0000P70322000051467 | | Interior | | Germany |
| 0000P70322000087468 | | Kitchen | | Switzerland |
| 0000P70322000031469 | | Leisure | | Germany |
| 0000P70322000031439 | | Food | | Switzerland |
+----------------------+--+------------------+--+------------------+
是否可以"automatically"基于列StoreCountryName
创建多个数据框,然后在算法中使用它,即在分析中使用特定国家/地区的数据框并遍历所有国家/地区?我知道我可以手动创建数据框,然后应用转换和分析。
您可以 groupby
并进行列表理解以将数据帧存储在列表中,然后遍历它们:
g = df.groupby('StoreCountryName')
dfs = [group for _,group in g]
for i in range(len(dfs)):
dfs[i]['iteration'] = i # do stuff to each frame
print(f"{dfs[i]} \n")
ReceiptCode ItemCategoryName StoreCountryName iteration
0 0000P70322000031467 Food Denmark 0
1 0000P70322000031867 Food Denmark 0
ReceiptCode ItemCategoryName StoreCountryName iteration
2 0000P70322000051467 Interior Germany 1
4 0000P70322000031469 Leisure Germany 1
ReceiptCode ItemCategoryName StoreCountryName iteration
3 0000P70322000087468 Kitchen Switzerland 2
5 0000P70322000031439 Food Switzerland 2
或者您可以创建一个函数并使用 groupby
和 apply
def myFunc(country):
# do stuff
df.groupby('StoreCountryName').apply(myFunc)