为什么在将 one_hot_encoder 应用于训练数据时添加 embarkation_point_2 字段
Why embarkation_point_2 field gets added when one_hot_encoder is applied to training data
的例子
它使用来自 kaggle
、
的泰坦尼克号数据
ONE_HOT_ENCODER_FIT
函数转换分类数据并创建一个代表分类数据新表示的模型
SELECT one_hot_encoder_fit('public.titanic_encoder','titanic_training','sex, embarkation_point' USING PARAMETERS exclude_columns='', output_view='', extra_levels='{}');
==================
varchar_categories
==================
category_name |category_level|category_level_index
-----------------+--------------+--------------------
embarkation_point| C | 0
embarkation_point| Q | 1
embarkation_point| S | 2 <- note S is 2
embarkation_point| | 3
sex | female | 0
sex | male | 1 <-- note male is 1
那么在 titanic_training
数据上应用这样的模型 titanic_encoder
,为什么会添加 embarkation_point_2
?输出是否应该只包含分类值(比如 S
)及其编码值?为什么我看到值 0
和 1
而不是 2
(这是 S
的编码值?类似于 sex
M
和 sex_1
1
dbadmin@2e4e746b3e6c(*)=> select * from titanic_training limit 1;
passenger_id | survived | pclass | name | sex | age | sibling_and_spouse_count | parent_and_child_count | ticket | fare | cabin | embarkation_point
--------------+----------+--------+-------------------------+------+-----+--------------------------+------------------------+-----------+------+-------+-------------------
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | | S <-- note S
(1 row)
dbadmin@2e4e746b3e6c(*)=> SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder') from titanic_training limit 1;
passenger_id | survived | pclass | name | sex | sex_1 | age | sibling_and_spouse_count | parent_and_child_count | ticket | fare | cabin | embarkation_point | embarkation_point_1 | embarkation_point_2 (<-- why this is here)?
--------------+----------+--------+-------------------------+------+-------+-----+--------------------------+------------------------+-----------+------+-------+-------------------+---------------------+---------------------
1 | 0 | 3 | Braund, Mr. Owen Harris | male <- note male| 1 <- note encoded value of male | 22 | 1 | 0 | A/5 21171 | 7.25 | | S <- note S | 0 <- why this is here | 1 <-- why this is here. Where is 2?
(1 row)
为什么没有embarkation_point_3
?
你的输出有很多原因。
首先,阅读 APPLY_ONE_HOT_ENCODER 的文档:
https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/APPLY_ONE_HOT_ENCODER.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CMachine%20Learning%20Functions%7CTransformation%20Functions%7C_____5
两个参数让您实现目标:
- drop_first:设置为false,获取所有列。一个因相关性目的而被删除。你可以看看这篇文章:https://inmachineswetrust.com/posts/drop-first-columns/有利也有弊
- column_naming:设置为值但要小心。如果您的类别包含特殊字符,您可能会遇到一些困难。
巴德尔
它使用来自 kaggle
、
ONE_HOT_ENCODER_FIT
函数转换分类数据并创建一个代表分类数据新表示的模型
SELECT one_hot_encoder_fit('public.titanic_encoder','titanic_training','sex, embarkation_point' USING PARAMETERS exclude_columns='', output_view='', extra_levels='{}');
==================
varchar_categories
==================
category_name |category_level|category_level_index
-----------------+--------------+--------------------
embarkation_point| C | 0
embarkation_point| Q | 1
embarkation_point| S | 2 <- note S is 2
embarkation_point| | 3
sex | female | 0
sex | male | 1 <-- note male is 1
那么在 titanic_training
数据上应用这样的模型 titanic_encoder
,为什么会添加 embarkation_point_2
?输出是否应该只包含分类值(比如 S
)及其编码值?为什么我看到值 0
和 1
而不是 2
(这是 S
的编码值?类似于 sex
M
和 sex_1
1
dbadmin@2e4e746b3e6c(*)=> select * from titanic_training limit 1;
passenger_id | survived | pclass | name | sex | age | sibling_and_spouse_count | parent_and_child_count | ticket | fare | cabin | embarkation_point
--------------+----------+--------+-------------------------+------+-----+--------------------------+------------------------+-----------+------+-------+-------------------
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | | S <-- note S
(1 row)
dbadmin@2e4e746b3e6c(*)=> SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder') from titanic_training limit 1;
passenger_id | survived | pclass | name | sex | sex_1 | age | sibling_and_spouse_count | parent_and_child_count | ticket | fare | cabin | embarkation_point | embarkation_point_1 | embarkation_point_2 (<-- why this is here)?
--------------+----------+--------+-------------------------+------+-------+-----+--------------------------+------------------------+-----------+------+-------+-------------------+---------------------+---------------------
1 | 0 | 3 | Braund, Mr. Owen Harris | male <- note male| 1 <- note encoded value of male | 22 | 1 | 0 | A/5 21171 | 7.25 | | S <- note S | 0 <- why this is here | 1 <-- why this is here. Where is 2?
(1 row)
为什么没有embarkation_point_3
?
你的输出有很多原因。 首先,阅读 APPLY_ONE_HOT_ENCODER 的文档: https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/APPLY_ONE_HOT_ENCODER.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CMachine%20Learning%20Functions%7CTransformation%20Functions%7C_____5
两个参数让您实现目标:
- drop_first:设置为false,获取所有列。一个因相关性目的而被删除。你可以看看这篇文章:https://inmachineswetrust.com/posts/drop-first-columns/有利也有弊
- column_naming:设置为值但要小心。如果您的类别包含特殊字符,您可能会遇到一些困难。
巴德尔