使用 pandas 从多索引系列数据框中找出重复的字符串元素
using pandas find out repeated string elements from multi-index series dataframe
在 Pandas 数据框中,其中一列是系列数据类型,即 food_column,我必须从此列中提取输出列
Input : food_column
[ 'bread','bread','bread'] ,
[ 'meat','butter','butter'] ,
[ 'meat', 'butter','bread','meat']
['butter']
['bread','meat','bread','meat']
Output : main_column
['bread']
['butter']
['meat']
['butter']
['bread']
条件:
- 如果任何字符串元素重复多次,则应将其选为输出元素,
- 如果任何两个或三个元素计数相同,则应从该两个或三个元素中选择 np.random.choice
- 如果任何行中只有一个元素,assign/map 该元素到输出列
- 否则将其标记为“未知”以输出列
import pandas as pd
import random
from collections import Counter
import numpy as np
food_list = [[ 'bread','bread','bread'] ,
['meat','butter','butter'] ,
['meat', 'butter','bread','meat'],
['butter'],
['bread','meat','bread','meat'],
['']]
food_series = pd.Series(food_list)
df = pd.DataFrame({'food_column': food_series})
# randomize list item order, since dict item order is constant in Python 3.6+
df['random_food_list'] = [random.sample(z, len(z)) for z in df['food_column'].to_list()]
# get counts
df['food_counts'] = df['random_food_list'].apply(lambda x: Counter(x))
# get key with max value
df['main_column'] = df['food_counts'].apply(lambda x: max(x, key=x.get))
# replace empty strings with 'unknown'
df['main_column'] = np.where(df['main_column'] == '', 'unknown', df['main_column'])
在 Pandas 数据框中,其中一列是系列数据类型,即 food_column,我必须从此列中提取输出列
Input : food_column
[ 'bread','bread','bread'] ,
[ 'meat','butter','butter'] ,
[ 'meat', 'butter','bread','meat']
['butter']
['bread','meat','bread','meat']
Output : main_column
['bread']
['butter']
['meat']
['butter']
['bread']
条件:
- 如果任何字符串元素重复多次,则应将其选为输出元素,
- 如果任何两个或三个元素计数相同,则应从该两个或三个元素中选择 np.random.choice
- 如果任何行中只有一个元素,assign/map 该元素到输出列
- 否则将其标记为“未知”以输出列
import pandas as pd
import random
from collections import Counter
import numpy as np
food_list = [[ 'bread','bread','bread'] ,
['meat','butter','butter'] ,
['meat', 'butter','bread','meat'],
['butter'],
['bread','meat','bread','meat'],
['']]
food_series = pd.Series(food_list)
df = pd.DataFrame({'food_column': food_series})
# randomize list item order, since dict item order is constant in Python 3.6+
df['random_food_list'] = [random.sample(z, len(z)) for z in df['food_column'].to_list()]
# get counts
df['food_counts'] = df['random_food_list'].apply(lambda x: Counter(x))
# get key with max value
df['main_column'] = df['food_counts'].apply(lambda x: max(x, key=x.get))
# replace empty strings with 'unknown'
df['main_column'] = np.where(df['main_column'] == '', 'unknown', df['main_column'])