找到设施数量最多的城市
Find the city with highest number of amenities
我目前正在尝试破解一个编程难题,该难题具有非常简单的数据框 host
,其中 2 列名为 city
和 amenities
(均为 object
数据类型) .现在,两列中的条目可以重复多次。以下是 host
的前几条是 beLOW
City Amenities Price($)
NYC {TV,"Wireless Internet", "Air conditioning","Smoke 8
detector",Essentials,"Lock on bedroom door"}
LA {"Wireless Internet",Kitchen,Washer,Dryer,"First aid
kit",Essentials,"Hair dryer","translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"}
10
SF {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,"Free
parking on premises","Pets live on this
property",Dog(s),"Indoor fireplace","Buzzer/wireless
intercom",Heating,Washer,Dryer,"Smoke detector","Carbon
monoxide detector","First aid kit","Safety card","Fire e
extinguisher",Essentials,Shampoo,"24-hour check-
in",Hangers,"Hair dryer",Iron,"Laptop friendly
workspace","translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50","Self Check-In",Lockbox} 15
NYC {"Wireless Internet","Air
conditioning",Kitchen,Heating,"Suitable for events","Smoke
detector","Carbon monoxide detector","First aid kit","Fire
extinguisher",Essentials,Shampoo,"Lock on bedroom
door",Hangers,"translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"} 20
LA {TV,Internet,"Wireless Internet","Air
conditioning",Kitchen,"Free parking on
premises",Essentials,Shampoo,"translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"}
LA {TV,"Cable TV",Internet,"Wireless Internet",Pool,Kitchen,"Free
parking on premises",Gym,Breakfast,"Hot tub","Indoor
fireplace",Heating,"Family/kid friendly",Washer,Dryer,"Smoke
detector","Carbon monoxide detector",Essentials,Shampoo,"Lock
on bedroom door",Hangers,"Private entrance"} 28
.....
问题。输出设施数量最多的城市。
我的尝试。 我尝试使用 groupby()
函数根据 city
列使用 host.groupby('city').
对其进行分组现在,我需要 count 成功计算每组Amenities中的元素数量。由于数据类型不同,len()
函数没有起作用,因为集合中每个元素之间有\
(例如,如果我使用host['amenities'][0],
,输出是"{TV,\"Wireless Internet\",\"Air conditioning\",\"Smoke detector\",\"Carbon monoxide detector\",Essentials,\"Lock on bedroom door\",Hangers,Iron}"
。将 len()
应用于此输出将导致 134,这显然是不正确的)。我尝试使用 host['amenities'][0].strip('\n')
删除 \,
但 len()
函数仍然给出 134.
谁能帮我解决这个问题?
我的解决方案,灵感来自 ddejohn 的解决方案:
### Transform each "string-type" entry in column "amenities" to "list" type
host["amenities"] = host["amenities"].str.replace('["{}]', "", regex=True).str.split(",")
## Create a new column that count all the amenities for each row
entry host["am_count"] = [len(data) for data in host["amenities"]]
## Output the index in the new column resulting from aggregation over the column `am_count` grouped by `city`
host.groupby("city")["am_count"].agg("sum").argmax()
解决方案
import functools
# Process the Amenities strings into sets of strings
host["amenities"] = host["amenities"].str.replace('["{}]', "", regex=True).str.split(",").apply(set)
# Groupby city, perform the set union to remove duplicates, and get count of unique amenities
amenities_by_city = host.groupby("city")["amenities"].apply(lambda x: len(functools.reduce(set.union, x))).reset_index()
输出:
city amenities
0 LA 27
1 NYC 17
2 SF 29
通过
获得设施最多的城市
city_with_most_amenities = amenities_by_city.query("amenities == amenities.max()")
输出:
city amenities
2 SF 29
我目前正在尝试破解一个编程难题,该难题具有非常简单的数据框 host
,其中 2 列名为 city
和 amenities
(均为 object
数据类型) .现在,两列中的条目可以重复多次。以下是 host
的前几条是 beLOW
City Amenities Price($)
NYC {TV,"Wireless Internet", "Air conditioning","Smoke 8
detector",Essentials,"Lock on bedroom door"}
LA {"Wireless Internet",Kitchen,Washer,Dryer,"First aid
kit",Essentials,"Hair dryer","translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"}
10
SF {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,"Free
parking on premises","Pets live on this
property",Dog(s),"Indoor fireplace","Buzzer/wireless
intercom",Heating,Washer,Dryer,"Smoke detector","Carbon
monoxide detector","First aid kit","Safety card","Fire e
extinguisher",Essentials,Shampoo,"24-hour check-
in",Hangers,"Hair dryer",Iron,"Laptop friendly
workspace","translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50","Self Check-In",Lockbox} 15
NYC {"Wireless Internet","Air
conditioning",Kitchen,Heating,"Suitable for events","Smoke
detector","Carbon monoxide detector","First aid kit","Fire
extinguisher",Essentials,Shampoo,"Lock on bedroom
door",Hangers,"translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"} 20
LA {TV,Internet,"Wireless Internet","Air
conditioning",Kitchen,"Free parking on
premises",Essentials,Shampoo,"translation missing:
en.hosting_amenity_49","translation missing:
en.hosting_amenity_50"}
LA {TV,"Cable TV",Internet,"Wireless Internet",Pool,Kitchen,"Free
parking on premises",Gym,Breakfast,"Hot tub","Indoor
fireplace",Heating,"Family/kid friendly",Washer,Dryer,"Smoke
detector","Carbon monoxide detector",Essentials,Shampoo,"Lock
on bedroom door",Hangers,"Private entrance"} 28
.....
问题。输出设施数量最多的城市。
我的尝试。 我尝试使用 groupby()
函数根据 city
列使用 host.groupby('city').
对其进行分组现在,我需要 count 成功计算每组Amenities中的元素数量。由于数据类型不同,len()
函数没有起作用,因为集合中每个元素之间有\
(例如,如果我使用host['amenities'][0],
,输出是"{TV,\"Wireless Internet\",\"Air conditioning\",\"Smoke detector\",\"Carbon monoxide detector\",Essentials,\"Lock on bedroom door\",Hangers,Iron}"
。将 len()
应用于此输出将导致 134,这显然是不正确的)。我尝试使用 host['amenities'][0].strip('\n')
删除 \,
但 len()
函数仍然给出 134.
谁能帮我解决这个问题?
我的解决方案,灵感来自 ddejohn 的解决方案:
### Transform each "string-type" entry in column "amenities" to "list" type
host["amenities"] = host["amenities"].str.replace('["{}]', "", regex=True).str.split(",")
## Create a new column that count all the amenities for each row
entry host["am_count"] = [len(data) for data in host["amenities"]]
## Output the index in the new column resulting from aggregation over the column `am_count` grouped by `city`
host.groupby("city")["am_count"].agg("sum").argmax()
解决方案
import functools
# Process the Amenities strings into sets of strings
host["amenities"] = host["amenities"].str.replace('["{}]', "", regex=True).str.split(",").apply(set)
# Groupby city, perform the set union to remove duplicates, and get count of unique amenities
amenities_by_city = host.groupby("city")["amenities"].apply(lambda x: len(functools.reduce(set.union, x))).reset_index()
输出:
city amenities
0 LA 27
1 NYC 17
2 SF 29
通过
获得设施最多的城市city_with_most_amenities = amenities_by_city.query("amenities == amenities.max()")
输出:
city amenities
2 SF 29