Julia：将分类数组转换为数值数组的最佳方法是什么？

Question

将分类数组转换为简单数值数组的完美方法是什么？例如：

using CategoricalArrays
a = CategoricalArray(["X", "X", "Y", "Z", "Y", "Y", "Z"])
b = recode(a, "X"=>1, "Y"=>2, "Z"=>3)

作为转换的结果，我们仍然得到一个分类数组，即使我们明确指定赋值的类型：

b = recode(a, "X"=>1::Int64, "Y"=>2::Int64, "Z"=>3::Int64)

看起来这里需要一些其他的方法，但我想不出一个方向

Answer 1

你有两个自然选择：

julia> recode(unwrap.(a), "X"=>1, "Y"=>2, "Z"=>3)
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

或

julia> mapping = Dict("X"=>1, "Y"=>2, "Z"=>3)
Dict{String, Int64} with 3 entries:
  "Y" => 2
  "Z" => 3
  "X" => 1

julia> [mapping[v] for v in a]
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

Dict 方法速度较慢，但在您要映射多个级别的情况下更灵活。

这里的关键函数是 unwrap，它删除了 CategoricalValue 的“分类”概念（在 Dict 样式中 unwrap 被自动调用）

另请注意，如果您只想获取存储在 CategoricalArray 中的值的 levelcode（R 默认情况下所做的事情），那么您可以这样做：

julia> levelcode.(a)
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

还要注意 levelcode missing 映射到 missing:

julia> x = CategoricalArray(["Y", "X", missing, "Z"])
4-element CategoricalArray{Union{Missing, String},1,UInt32}:
 "Y"
 "X"
 missing
 "Z"

julia> levelcode.(x)
4-element Vector{Union{Missing, Int64}}:
 2
 1
  missing
 3

Answer 2

除了 Bogumił 的答案之外，一个应该相当快的可能方法是：

julia> b = recode!(similar(a, Int), a, "X"=>1, "Y"=>2, "Z"=>3)
7-element Vector{Int64}:
 1
 1
 2
 3
 2
 2
 3

Answer 3

涵盖了大部分问题，但我认为再添加一个解决方案可能会有用：

unwrap.(recode(a, "X"=>1, "Y"=>2, "Z"=>3))

随着 CategoricalArray 的长度相对于类别的数量增加，此解决方案变得比任何其他解决方案（截至目前）的性能更高，并且对我来说似乎是一个非常自然的解决方案（它几乎与OP 的尝试）。更重要的是，它在这些情况下的性能更高这一事实说明了有关 CategoricalArrays 的事情以及调用这些函数时实际发生的事情。

通过在 a 上调用 dump，您可以看到这个分类数组的结构。这是一个简化版本：

CategoricalVector{String, UInt32, String, CategoricalValue{String, UInt32}, Union{}}
  refs: UInt32[0x00000001, 0x00000001, 0x00000002, 0x00000003, 0x00000002, 0x00000002, 0x00000003]
  pool: CategoricalPool{String, UInt32, CategoricalValue{String, UInt32}}
    levels: String["X","Y","Z"]
    invindex: Dict{String, UInt32}("Y" => 0x00000002, "Z" => 0x00000003, "X" => 0x00000001)

每个类别都编码为 UInt32。编码值存储在 Vector refs 中。 CategoricalPool pool 包含：

levels：level-code到类别的映射（Vector{String}，“key”是索引）
invindex：从类别到级别代码的映射 (Dict{String, UInt32})

可以非常有效地重新编码此结构。在许多情况下，我们可以创建一个包含新类别的分类数组，而无需通过交换描述代码的 pool 部分来完全不触及 refs：

mapping = Dict("X"=>1, "Y"=>2, "Z"=>3)
b = CategoricalArray{Int64,1,UInt32}(undef, 0)
b.refs = a.refs
levels!(b.pool, [mapping[l] for l in levels(a.pool)])

在实际的 recode 函数中，创建了一个与 a 长度相同的新空分类数组，并考虑了更多的边缘情况（可能最重要的是将多个类别折叠成一个的情况在新代码中）。

广播的 unwrap 然后由 Julia 能够很好优化的池中的简单查找组成。

基准

:

@btime recode(unwrap.($a), "X"=>1, "Y"=>2, "Z"=>3)

length	btime result
100	1.268 μs (5 allocations: 1.84 KiB)
1000	14.872 μs (5 allocations: 15.97 KiB)
10000	151.881 μs (7 allocations: 156.50 KiB)

:

@btime [$mapping[v] for v in $a]

length	btime result
100	2.439 μs (101 allocations: 4.00 KiB)
1000	23.715 μs (1001 allocations: 39.19 KiB)
10000	240.292 μs (10002 allocations: 390.70 KiB)

:

@btime recode!(similar($a, Int), $a, "X"=>1, "Y"=>2, "Z"=>3)

length	btime result
100	2.158 μs (104 allocations: 4.09 KiB)
1000	21.347 μs (1004 allocations: 39.28 KiB)
10000	208.035 μs (10005 allocations: 390.80 KiB)

这个解决方案：

@btime unwrap.(recode($a, "X"=>1, "Y"=>2, "Z"=>3))

length	btime result
100	2.360 μs (45 allocations: 4.56 KiB)
1000	4.420 μs (45 allocations: 15.20 KiB)
10000	20.212 μs (47 allocations: 120.55 KiB)

Julia：将分类数组转换为数值数组的最佳方法是什么？

Julia: What is the perfect way to convert a categorical array to a numeric array?

arrays

julia

categorical-data

基准