将分类变量转换为数字 PowerQuery
Convert categorical variables to numeric PowerQuery
我的 table 中有许多列,其中包含属于类别的文本值 - 例如“ABC”列有 9000 行,但每一行都必须在集合 {“A”,“B”,“C”} 中有一个值。性别等其他列有 "M"/"F"/null
对于每一列,我想将其就地转换为整数列表 - 所以 A:1、B:2、C:3 等
我一直在尝试使用 List.Distinct
将值提取到临时值 table,向其添加索引列并使用连接根据中的映射转换初始列温度 table。然而,这看起来很慢,我不确定如何在我的 table 中的所有列中 运行 (或者至少 Table.ColumnsOfType(Source, {type nullable text})
到 select 分类列...) .
有什么建议吗?
之前
Gender
Fruit
[...]
F
Cat
F
Dog
M
Lemon
M
Dog
M
Lemon
null
Cat
M
Dog
之后
Gender
Fruit
[...]
1
1
1
2
2
3
2
2
2
3
null
1
2
2
在 PowerQuery 中,这似乎适用于任意数量的列
用其他东西替换所有空值,这里 +=+
添加索引
逆轴
删除重复项
分组,为每个分组添加索引
合并回原来的展开
重新调整
删除多余的列
之前和之后:
完整代码:
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Replaced Value" = Table.ReplaceValue(Source,null,"+=+",Replacer.ReplaceValue,Table.ColumnNames(Source)),
#"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Attribute", "Value"),
// derive a table of replacements
#"Removed Duplicates" = Table.Distinct(#"Unpivoted Other Columns", {"Attribute", "Value"}),
#"Grouped Rows" = Table.Group(#"Removed Duplicates", {"Attribute"}, {{"GRP", each Table.AddIndexColumn(_, "Index2", 1, 1), type table}}),
#"Expanded GRP" = Table.ExpandTableColumn(#"Grouped Rows", "GRP", {"Value", "Index2"}, {"Value", "Index2"}),
//replace originals
#"Merged Queries" = Table.NestedJoin(#"Unpivoted Other Columns",{"Attribute", "Value"},#"Expanded GRP",{"Attribute", "Value"},"EG",JoinKind.LeftOuter),
#"Expanded Table1" = Table.ExpandTableColumn(#"Merged Queries", "EG", {"Index2"}, {"Index2"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Table1",{"Value"}),
#"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[Attribute]), "Attribute", "Index2", List.Sum),
#"Removed Columns1" = Table.RemoveColumns(#"Pivoted Column",{"Index"})
in #"Removed Columns1"
我的 table 中有许多列,其中包含属于类别的文本值 - 例如“ABC”列有 9000 行,但每一行都必须在集合 {“A”,“B”,“C”} 中有一个值。性别等其他列有 "M"/"F"/null
对于每一列,我想将其就地转换为整数列表 - 所以 A:1、B:2、C:3 等
我一直在尝试使用 List.Distinct
将值提取到临时值 table,向其添加索引列并使用连接根据中的映射转换初始列温度 table。然而,这看起来很慢,我不确定如何在我的 table 中的所有列中 运行 (或者至少 Table.ColumnsOfType(Source, {type nullable text})
到 select 分类列...) .
有什么建议吗?
之前
Gender | Fruit | [...] |
---|---|---|
F | Cat | |
F | Dog | |
M | Lemon | |
M | Dog | |
M | Lemon | |
null | Cat | |
M | Dog |
之后
Gender | Fruit | [...] |
---|---|---|
1 | 1 | |
1 | 2 | |
2 | 3 | |
2 | 2 | |
2 | 3 | |
null | 1 | |
2 | 2 |
在 PowerQuery 中,这似乎适用于任意数量的列
用其他东西替换所有空值,这里 +=+
添加索引
逆轴
删除重复项
分组,为每个分组添加索引
合并回原来的展开
重新调整
删除多余的列
之前和之后:
完整代码:
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Replaced Value" = Table.ReplaceValue(Source,null,"+=+",Replacer.ReplaceValue,Table.ColumnNames(Source)),
#"Added Index" = Table.AddIndexColumn(#"Replaced Value", "Index", 0, 1),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Added Index", {"Index"}, "Attribute", "Value"),
// derive a table of replacements
#"Removed Duplicates" = Table.Distinct(#"Unpivoted Other Columns", {"Attribute", "Value"}),
#"Grouped Rows" = Table.Group(#"Removed Duplicates", {"Attribute"}, {{"GRP", each Table.AddIndexColumn(_, "Index2", 1, 1), type table}}),
#"Expanded GRP" = Table.ExpandTableColumn(#"Grouped Rows", "GRP", {"Value", "Index2"}, {"Value", "Index2"}),
//replace originals
#"Merged Queries" = Table.NestedJoin(#"Unpivoted Other Columns",{"Attribute", "Value"},#"Expanded GRP",{"Attribute", "Value"},"EG",JoinKind.LeftOuter),
#"Expanded Table1" = Table.ExpandTableColumn(#"Merged Queries", "EG", {"Index2"}, {"Index2"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded Table1",{"Value"}),
#"Pivoted Column" = Table.Pivot(#"Removed Columns", List.Distinct(#"Removed Columns"[Attribute]), "Attribute", "Index2", List.Sum),
#"Removed Columns1" = Table.RemoveColumns(#"Pivoted Column",{"Index"})
in #"Removed Columns1"