有没有更简单的方法从 groupby 获取对象并放入字典?
Is there a simpler way to get object from groupby and putting in dictionary?
所以我的数据框看起来像这样:
我试图找到一种更简单的方法来从 Groupby 中获取对象,然后将其放入字典中。
我必须获取索引,然后执行 for 循环以获取 Product.
中每一行的确切字符串
如果需要更多详细信息:
我的目标是找到重复的订单 ID,然后从列中取出产品并添加到字典中:
关键 = 产品
值 = 产品被发现一起订购的次数
(我不是在寻找优化查找重复项的方法,我知道我可以使用 df.duplicated
)
代码:
for date, df in df1.groupby('Order Date'):
if df.Product.count() > 1:
indice = df.Product.index
for data in indice:
product = df.loc[data].at['Product']
#update dictionary counter
product_dict[product] = product_dict.get(product) + 1
else:
continue
为方便起见,您可以改用此 df。我列为字典:
{'Order ID': ['147268', '148041', '149343', '149964', '149350', '141732', '149620', '142451', '146039', '143498', '141316', '144804', '144804', '145270', '142789'],
'Product': ['Wired Headphones', 'USB-C Charging Cable', 'Apple Airpods Headphones', 'AAA Batteries (4-pack)', 'USB-C Charging Cable', 'iPhone', 'Lightning Charging Cable', 'AAA Batteries (4-pack)', '34in Ultrawide Monitor', 'AA Batteries (4-pack)', 'AAA Batteries (4-pack)', 'Wired Headphones', 'iPhone', 'Google Phone', 'AAA Batteries (4-pack)']}
预期输出:
{'Wired Headphones': 8090, 'USB-C Charging Cable': 9425, 'Apple Airpods Headphones': 6374, 'AAA Batteries (4-pack)': 8266, 'iPhone': 3663, 'Lightning Charging Cable': 9074, '34in 超宽显示器': 2500, 'AA Batteries (4-pack)': 8167, 'Google Phone': 3091, 'Macbook Pro Laptop': 1878, 'ThinkPad Laptop': 1605, '27in FHD 显示器': 3010,'Bose SoundSport Headphones':5459,'Flatscreen TV':1827,“27 英寸 4K 游戏显示器”:2457,'LG Dryer':257,“20 英寸显示器”:1635,'LG Washing Machine':268, 'Vareebadd Phone': 1120}
# number of products per order
prods_per_order = df.groupby(['Order ID'])["Product"].transform("count")
res = (
df.loc[prods_per_order > 1, "Product"] # Select only the products that were ordered together with another(s) product(s)
.value_counts() # count how many times were per product
.to_dict() # convert the result to a dict
)
输入
df = pd.DataFrame({
'Order ID': ['147268', '148041', '149343', '149964', '149350',
'141732', '149620', '142451', '146039', '143498',
'141316', '144804', '144804', '145270', '142789'],
'Product': ['Wired Headphones', 'USB-C Charging Cable', 'Apple Airpods Headphones',
'AAA Batteries (4-pack)', 'USB-C Charging Cable', 'iPhone',
'Lightning Charging Cable', 'AAA Batteries (4-pack)', '34in Ultrawide Monitor',
'AA Batteries (4-pack)', 'AAA Batteries (4-pack)', 'Wired Headphones',
'iPhone', 'Google Phone', 'AAA Batteries (4-pack)']
})
df = df.sort_values(['Order ID', 'Product'])
>>> df
Order ID Product
10 141316 AAA Batteries (4-pack)
5 141732 iPhone
7 142451 AAA Batteries (4-pack)
14 142789 AAA Batteries (4-pack)
9 143498 AA Batteries (4-pack)
11 144804 Wired Headphones # <-- Note that only these two products
12 144804 iPhone # <-- were ordered together
13 145270 Google Phone
8 146039 34in Ultrawide Monitor
0 147268 Wired Headphones
1 148041 USB-C Charging Cable
2 149343 Apple Airpods Headphones
4 149350 USB-C Charging Cable
6 149620 Lightning Charging Cable
3 149964 AAA Batteries (4-pack)
输出
>>> res
{'iPhone': 1, 'Wired Headphones': 1}
也许我误解了,但这似乎可以通过使用 Counter
:
来实现您想要实现的目标
from collections import Counter
mask = (
df.groupby(["Order Date", "Order ID"], sort=False)["Product"]
.transform("count")
.gt(1)
)
product_dict = Counter(df.loc[mask, "Product"])
略微修改示例数据框的结果(添加了 Order Date
列)
Order Date Order ID Product
0 2021-11-11 147268 Wired Headphones
1 2021-11-11 148041 USB-C Charging Cable
2 2021-11-11 149343 Apple Airpods Headphones
3 2021-11-11 149964 AAA Batteries (4-pack)
4 2021-11-11 149350 USB-C Charging Cable
5 2021-11-12 141732 iPhone
6 2021-11-12 149620 Lightning Charging Cable
7 2021-11-12 142451 AAA Batteries (4-pack)
8 2021-11-12 146039 34in Ultrawide Monitor
9 2021-11-12 143498 AA Batteries (4-pack)
10 2021-11-12 141316 AAA Batteries (4-pack)
11 2021-11-12 144804 Wired Headphones
12 2021-11-12 144804 iPhone
13 2021-11-12 145270 Google Phone
14 2021-11-12 142789 AAA Batteries (4-pack)
是
Counter({'Wired Headphones': 1, 'iPhone': 1})
也许 groupby
超过 Order ID
就足够了,但由于你在 Order Date
上分组,我怀疑它不够。
所以我的数据框看起来像这样:
如果需要更多详细信息: 我的目标是找到重复的订单 ID,然后从列中取出产品并添加到字典中:
关键 = 产品
值 = 产品被发现一起订购的次数
(我不是在寻找优化查找重复项的方法,我知道我可以使用 df.duplicated
)
代码:
for date, df in df1.groupby('Order Date'):
if df.Product.count() > 1:
indice = df.Product.index
for data in indice:
product = df.loc[data].at['Product']
#update dictionary counter
product_dict[product] = product_dict.get(product) + 1
else:
continue
为方便起见,您可以改用此 df。我列为字典:
{'Order ID': ['147268', '148041', '149343', '149964', '149350', '141732', '149620', '142451', '146039', '143498', '141316', '144804', '144804', '145270', '142789'],
'Product': ['Wired Headphones', 'USB-C Charging Cable', 'Apple Airpods Headphones', 'AAA Batteries (4-pack)', 'USB-C Charging Cable', 'iPhone', 'Lightning Charging Cable', 'AAA Batteries (4-pack)', '34in Ultrawide Monitor', 'AA Batteries (4-pack)', 'AAA Batteries (4-pack)', 'Wired Headphones', 'iPhone', 'Google Phone', 'AAA Batteries (4-pack)']}
预期输出:
{'Wired Headphones': 8090, 'USB-C Charging Cable': 9425, 'Apple Airpods Headphones': 6374, 'AAA Batteries (4-pack)': 8266, 'iPhone': 3663, 'Lightning Charging Cable': 9074, '34in 超宽显示器': 2500, 'AA Batteries (4-pack)': 8167, 'Google Phone': 3091, 'Macbook Pro Laptop': 1878, 'ThinkPad Laptop': 1605, '27in FHD 显示器': 3010,'Bose SoundSport Headphones':5459,'Flatscreen TV':1827,“27 英寸 4K 游戏显示器”:2457,'LG Dryer':257,“20 英寸显示器”:1635,'LG Washing Machine':268, 'Vareebadd Phone': 1120}
# number of products per order
prods_per_order = df.groupby(['Order ID'])["Product"].transform("count")
res = (
df.loc[prods_per_order > 1, "Product"] # Select only the products that were ordered together with another(s) product(s)
.value_counts() # count how many times were per product
.to_dict() # convert the result to a dict
)
输入
df = pd.DataFrame({
'Order ID': ['147268', '148041', '149343', '149964', '149350',
'141732', '149620', '142451', '146039', '143498',
'141316', '144804', '144804', '145270', '142789'],
'Product': ['Wired Headphones', 'USB-C Charging Cable', 'Apple Airpods Headphones',
'AAA Batteries (4-pack)', 'USB-C Charging Cable', 'iPhone',
'Lightning Charging Cable', 'AAA Batteries (4-pack)', '34in Ultrawide Monitor',
'AA Batteries (4-pack)', 'AAA Batteries (4-pack)', 'Wired Headphones',
'iPhone', 'Google Phone', 'AAA Batteries (4-pack)']
})
df = df.sort_values(['Order ID', 'Product'])
>>> df
Order ID Product
10 141316 AAA Batteries (4-pack)
5 141732 iPhone
7 142451 AAA Batteries (4-pack)
14 142789 AAA Batteries (4-pack)
9 143498 AA Batteries (4-pack)
11 144804 Wired Headphones # <-- Note that only these two products
12 144804 iPhone # <-- were ordered together
13 145270 Google Phone
8 146039 34in Ultrawide Monitor
0 147268 Wired Headphones
1 148041 USB-C Charging Cable
2 149343 Apple Airpods Headphones
4 149350 USB-C Charging Cable
6 149620 Lightning Charging Cable
3 149964 AAA Batteries (4-pack)
输出
>>> res
{'iPhone': 1, 'Wired Headphones': 1}
也许我误解了,但这似乎可以通过使用 Counter
:
from collections import Counter
mask = (
df.groupby(["Order Date", "Order ID"], sort=False)["Product"]
.transform("count")
.gt(1)
)
product_dict = Counter(df.loc[mask, "Product"])
略微修改示例数据框的结果(添加了 Order Date
列)
Order Date Order ID Product
0 2021-11-11 147268 Wired Headphones
1 2021-11-11 148041 USB-C Charging Cable
2 2021-11-11 149343 Apple Airpods Headphones
3 2021-11-11 149964 AAA Batteries (4-pack)
4 2021-11-11 149350 USB-C Charging Cable
5 2021-11-12 141732 iPhone
6 2021-11-12 149620 Lightning Charging Cable
7 2021-11-12 142451 AAA Batteries (4-pack)
8 2021-11-12 146039 34in Ultrawide Monitor
9 2021-11-12 143498 AA Batteries (4-pack)
10 2021-11-12 141316 AAA Batteries (4-pack)
11 2021-11-12 144804 Wired Headphones
12 2021-11-12 144804 iPhone
13 2021-11-12 145270 Google Phone
14 2021-11-12 142789 AAA Batteries (4-pack)
是
Counter({'Wired Headphones': 1, 'iPhone': 1})
也许 groupby
超过 Order ID
就足够了,但由于你在 Order Date
上分组,我怀疑它不够。