如何使用 Sklearn.preprocessing 对包含列表的 pandas.DataFrame 列进行编码
How to encode a pandas.DataFrame column containing lists using Sklearn.preprocessing
我有一个 pandas df,其中一些列是包含数据的列表,我想对列表中的标签进行编码。
我收到这个错误:
ValueError: Expected 2D array, got 1D array instead:
from sklearn.preprocessing import OneHotEncoder
mins = pd.read_csv('recipes.csv')
enc = OneHotEncoder(handle_unknown='ignore')
X = mins['Ingredients']
'''
[[lettuce, tomatoes, ginger, vodka, tomatoes]
[lettuce, tomatoes, flour, vodka, tomatoes]
...
[flour, tomatoes, vodka, vodka, mustard]]
'''
enc.fit(X)
我希望得到一列包含正确编码信息的列表
[[lettuce, tomatoes, ginger, vodka, tomatoes]
[lettuce, tomatoes, flour, vodka, tomatoes]
...
[flour, tomatoes, vodka, vodka, mustard]
[[0, 1, 2, 3, 1]
[0, 1, 4, 3, 1]
...
[4, 1, 3, 3, 9]]
为了在 DataFrame 系列中标记编码列表列表,我们首先使用唯一的文本标签训练编码器,然后使用 apply
到 transform
每个文本标签到训练的整数标签列表列表。这是一个例子:
In [2]: import pandas as pd
In [3]: from sklearn import preprocessing
In [4]: df = pd.DataFrame({"Day":["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"], "Veggies&Drinks":[["lettuce"
...: , "tomatoes", "ginger", "vodka", "tomatoes"], ["flour", "vodka", "mustard", "lettuce", "ginger"], ["mustard", "
...: tomatoes", "ginger", "vodka", "tomatoes"], ["ginger", "vodka", "lettuce", "tomatoes", "flour"], ["mustard", "le
...: ttuce", "ginger", "flour", "tomatoes"]]})
In [5]: df
Out[5]:
Day Veggies&Drinks
0 Monday [lettuce, tomatoes, ginger, vodka, tomatoes]
1 Tuesday [flour, vodka, mustard, lettuce, ginger]
2 Wednesday [mustard, tomatoes, ginger, vodka, tomatoes]
3 Thursday [ginger, vodka, lettuce, tomatoes, flour]
4 Friday [mustard, lettuce, ginger, flour, tomatoes]
In [9]: label_encoder = preprocessing.LabelEncoder()
In [19]: list_of_veggies_drinks = ["lettuce","tomatoes","ginger","vodka","flour","mustard"]
In [20]: label_encoder.fit(list_of_veggies_drinks)
Out[20]: LabelEncoder()
In [21]: integer_encoded = df["Veggies&Drinks"].apply(lambda x:label_encoder.transform(x))
In [22]: integer_encoded
Out[22]:
0 [2, 4, 1, 5, 4]
1 [0, 5, 3, 2, 1]
2 [3, 4, 1, 5, 4]
3 [1, 5, 2, 4, 0]
4 [3, 2, 1, 0, 4]
Name: Veggies&Drinks, dtype: object
In [23]: df["Encoded"] = integer_encoded
In [24]: df
Out[24]:
Day Veggies&Drinks Encoded
0 Monday [lettuce, tomatoes, ginger, vodka, tomatoes] [2, 4, 1, 5, 4]
1 Tuesday [flour, vodka, mustard, lettuce, ginger] [0, 5, 3, 2, 1]
2 Wednesday [mustard, tomatoes, ginger, vodka, tomatoes] [3, 4, 1, 5, 4]
3 Thursday [ginger, vodka, lettuce, tomatoes, flour] [1, 5, 2, 4, 0]
4 Friday [mustard, lettuce, ginger, flour, tomatoes] [3, 2, 1, 0, 4]
因为您想将它直接应用到 pandas.DataFrame
:
from sklearn.preprocessing import LabelEncoder
# Get a flat list with all the ingredients
all_ingr = mins.Ingredients.apply(pd.Series).stack().values
enc = LabelEncoder()
enc.fit(all_ingr)
mins['Ingredients_enc'] = mins.Ingredients.apply(enc.transform)
我有一个 pandas df,其中一些列是包含数据的列表,我想对列表中的标签进行编码。
我收到这个错误:
ValueError: Expected 2D array, got 1D array instead:
from sklearn.preprocessing import OneHotEncoder
mins = pd.read_csv('recipes.csv')
enc = OneHotEncoder(handle_unknown='ignore')
X = mins['Ingredients']
'''
[[lettuce, tomatoes, ginger, vodka, tomatoes]
[lettuce, tomatoes, flour, vodka, tomatoes]
...
[flour, tomatoes, vodka, vodka, mustard]]
'''
enc.fit(X)
我希望得到一列包含正确编码信息的列表
[[lettuce, tomatoes, ginger, vodka, tomatoes]
[lettuce, tomatoes, flour, vodka, tomatoes]
...
[flour, tomatoes, vodka, vodka, mustard]
[[0, 1, 2, 3, 1]
[0, 1, 4, 3, 1]
...
[4, 1, 3, 3, 9]]
为了在 DataFrame 系列中标记编码列表列表,我们首先使用唯一的文本标签训练编码器,然后使用 apply
到 transform
每个文本标签到训练的整数标签列表列表。这是一个例子:
In [2]: import pandas as pd
In [3]: from sklearn import preprocessing
In [4]: df = pd.DataFrame({"Day":["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"], "Veggies&Drinks":[["lettuce"
...: , "tomatoes", "ginger", "vodka", "tomatoes"], ["flour", "vodka", "mustard", "lettuce", "ginger"], ["mustard", "
...: tomatoes", "ginger", "vodka", "tomatoes"], ["ginger", "vodka", "lettuce", "tomatoes", "flour"], ["mustard", "le
...: ttuce", "ginger", "flour", "tomatoes"]]})
In [5]: df
Out[5]:
Day Veggies&Drinks
0 Monday [lettuce, tomatoes, ginger, vodka, tomatoes]
1 Tuesday [flour, vodka, mustard, lettuce, ginger]
2 Wednesday [mustard, tomatoes, ginger, vodka, tomatoes]
3 Thursday [ginger, vodka, lettuce, tomatoes, flour]
4 Friday [mustard, lettuce, ginger, flour, tomatoes]
In [9]: label_encoder = preprocessing.LabelEncoder()
In [19]: list_of_veggies_drinks = ["lettuce","tomatoes","ginger","vodka","flour","mustard"]
In [20]: label_encoder.fit(list_of_veggies_drinks)
Out[20]: LabelEncoder()
In [21]: integer_encoded = df["Veggies&Drinks"].apply(lambda x:label_encoder.transform(x))
In [22]: integer_encoded
Out[22]:
0 [2, 4, 1, 5, 4]
1 [0, 5, 3, 2, 1]
2 [3, 4, 1, 5, 4]
3 [1, 5, 2, 4, 0]
4 [3, 2, 1, 0, 4]
Name: Veggies&Drinks, dtype: object
In [23]: df["Encoded"] = integer_encoded
In [24]: df
Out[24]:
Day Veggies&Drinks Encoded
0 Monday [lettuce, tomatoes, ginger, vodka, tomatoes] [2, 4, 1, 5, 4]
1 Tuesday [flour, vodka, mustard, lettuce, ginger] [0, 5, 3, 2, 1]
2 Wednesday [mustard, tomatoes, ginger, vodka, tomatoes] [3, 4, 1, 5, 4]
3 Thursday [ginger, vodka, lettuce, tomatoes, flour] [1, 5, 2, 4, 0]
4 Friday [mustard, lettuce, ginger, flour, tomatoes] [3, 2, 1, 0, 4]
因为您想将它直接应用到 pandas.DataFrame
:
from sklearn.preprocessing import LabelEncoder
# Get a flat list with all the ingredients
all_ingr = mins.Ingredients.apply(pd.Series).stack().values
enc = LabelEncoder()
enc.fit(all_ingr)
mins['Ingredients_enc'] = mins.Ingredients.apply(enc.transform)