为什么在将 one_hot_encoder 应用于训练数据时添加 embarkation_point_2 字段

Question

效仿 vertica 在 https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/AnalyzingData/MachineLearning/DataPreparation/EncodingCategoricalColumns.htm?tocpath=Analyzing%20Data%7CMachine%20Learning%20for%20Predictive%20Analytics%7CData%20Preparation%7C_____3

的例子

它使用来自 kaggle、

的泰坦尼克号数据

ONE_HOT_ENCODER_FIT 函数转换分类数据并创建一个代表分类数据新表示的模型

SELECT one_hot_encoder_fit('public.titanic_encoder','titanic_training','sex, embarkation_point'  USING PARAMETERS exclude_columns='', output_view='', extra_levels='{}');

==================
varchar_categories
==================
  category_name  |category_level|category_level_index
-----------------+--------------+--------------------
embarkation_point|      C       |         0
embarkation_point|      Q       |         1
embarkation_point|      S       |         2 <- note S is 2
embarkation_point|              |         3
       sex       |    female    |         0
       sex       |     male     |         1 <-- note male is 1

那么在 titanic_training 数据上应用这样的模型 titanic_encoder，为什么会添加 embarkation_point_2？输出是否应该只包含分类值（比如 S）及其编码值？为什么我看到值 0 和 1 而不是 2（这是 S 的编码值？类似于 sex M 和 sex_1 1

dbadmin@2e4e746b3e6c(*)=> select * from titanic_training limit 1;
 passenger_id | survived | pclass |          name           | sex  | age | sibling_and_spouse_count | parent_and_child_count |  ticket   | fare | cabin | embarkation_point
--------------+----------+--------+-------------------------+------+-----+--------------------------+------------------------+-----------+------+-------+-------------------
            1 |        0 |      3 | Braund, Mr. Owen Harris | male |  22 |                        1 |                      0 | A/5 21171 | 7.25 |       | S <-- note S
(1 row)



dbadmin@2e4e746b3e6c(*)=> SELECT APPLY_ONE_HOT_ENCODER(* USING PARAMETERS model_name='titanic_encoder') from titanic_training limit 1;
 passenger_id | survived | pclass |          name           | sex  | sex_1 | age | sibling_and_spouse_count | parent_and_child_count |  ticket   | fare | cabin | embarkation_point | embarkation_point_1 | embarkation_point_2 (<-- why this is here)?
--------------+----------+--------+-------------------------+------+-------+-----+--------------------------+------------------------+-----------+------+-------+-------------------+---------------------+---------------------
            1 |        0 |      3 | Braund, Mr. Owen Harris | male <- note male|     1 <- note  encoded value of male |  22 |                        1 |                      0 | A/5 21171 | 7.25 |       | S <- note S                 |                   0 <- why this is here |                   1 <-- why this is here. Where is 2?
(1 row)

为什么没有embarkation_point_3？

Answer 1

你的输出有很多原因。首先，阅读 APPLY_ONE_HOT_ENCODER 的文档： https://www.vertica.com/docs/11.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/MachineLearning/APPLY_ONE_HOT_ENCODER.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CMachine%20Learning%20Functions%7CTransformation%20Functions%7C_____5

两个参数让您实现目标：

drop_first：设置为false，获取所有列。一个因相关性目的而被删除。你可以看看这篇文章：https://inmachineswetrust.com/posts/drop-first-columns/有利也有弊
column_naming：设置为值但要小心。如果您的类别包含特殊字符，您可能会遇到一些困难。

巴德尔

为什么在将 one_hot_encoder 应用于训练数据时添加 embarkation_point_2 字段

Why embarkation_point_2 field gets added when one_hot_encoder is applied to training data

machine-learning

vertica

one-hot-encoding