将数据集拆分为 julia 中的训练和测试

Question

我正在尝试将数据集拆分为 Julia 中的训练和测试子集。到目前为止，我已经尝试使用 MLDataUtils.jl 包进行此操作，但是，结果不符合预期。以下是我的发现和问题：

代码

# the inputs are

a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
              B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
              C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
             )
b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

using MLDataUtils
(x1, y1), (x2, y2) = stratifiedobs((a,b), p=0.7)

#Output of this operation is: (which is not the expectation)
println("x1 is: $x1")
x1 is:
10×3 DataFrame
│ Row │ A     │ B     │ C     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │
│ 2   │ 2     │ 2     │ 2     │
│ 3   │ 3     │ 3     │ 3     │
│ 4   │ 4     │ 4     │ 4     │
│ 5   │ 5     │ 5     │ 5     │
│ 6   │ 6     │ 6     │ 6     │
│ 7   │ 7     │ 7     │ 7     │
│ 8   │ 8     │ 8     │ 8     │
│ 9   │ 9     │ 9     │ 9     │
│ 10  │ 10    │ 10    │ 10    │

println("y1 is: $y1")
y1 is:
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

# but x2 is printed as 
(0×3 SubDataFrame, Float64[]) 

# while y2 as 
0-element view(::Array{Float64,1}, Int64[]) with eltype Float64)

但是，我希望将此数据集分成两部分，其中 70% 的数据在训练中，30% 的数据在测试中。请建议在 julia 中执行此操作的更好方法。提前致谢。

Answer 1

可能 MLJ.jl 开发人员可以向您展示如何使用通用生态系统进行操作。这是仅使用 DataFrames.jl 的解决方案：

julia> using DataFrames, Random

julia> a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
                     B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
                     C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
                    )
10×3 DataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      1
   2 │     2      2      2
   3 │     3      3      3
   4 │     4      4      4
   5 │     5      5      5
   6 │     6      6      6
   7 │     7      7      7
   8 │     8      8      8
   9 │     9      9      9
  10 │    10     10     10

julia> function splitdf(df, pct)
           @assert 0 <= pct <= 1
           ids = collect(axes(df, 1))
           shuffle!(ids)
           sel = ids .<= nrow(df) .* pct
           return view(df, sel, :), view(df, .!sel, :)
       end
splitdf (generic function with 1 method)

julia> splitdf(a, 0.7)
(7×3 SubDataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     3      3      3
   2 │     4      4      4
   3 │     6      6      6
   4 │     7      7      7
   5 │     8      8      8
   6 │     9      9      9
   7 │    10     10     10, 3×3 SubDataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      1
   2 │     2      2      2
   3 │     5      5      5)

我正在使用视图来节省内存，但如果您愿意，您也可以只具体化训练和测试数据帧。

Answer 2

这就是我在 Beta Machine Learning Toolkit:

中为泛型数组实现它的方式

"""
    partition(data,parts;shuffle=true)
Partition (by rows) one or more matrices according to the shares in `parts`.
# Parameters
* `data`: A matrix/vector or a vector of matrices/vectors
* `parts`: A vector of the required shares (must sum to 1)
* `shufle`: Wheter to randomly shuffle the matrices (preserving the relative order between matrices)
 """
function partition(data::AbstractArray{T,1},parts::AbstractArray{Float64,1};shuffle=true) where T <: AbstractArray
        n = size(data[1],1)
        if !all(size.(data,1) .== n)
            @error "All matrices passed to `partition` must have the same number of rows"
        end
        ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
        return partition.(data,Ref(parts);shuffle=shuffle, fixedRIdx = ridx)
end

function partition(data::AbstractArray{T,N} where N, parts::AbstractArray{Float64,1};shuffle=true,fixedRIdx=Int64[]) where T
    n = size(data,1)
    nParts = size(parts)
    toReturn = []
    if !(sum(parts) ≈ 1)
        @error "The sum of `parts` in `partition` should total to 1."
    end
    ridx = fixedRIdx
    if (isempty(ridx))
       ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
    end
    current = 1
    cumPart = 0.0
    for (i,p) in enumerate(parts)
        cumPart += parts[i]
        final = i == nParts ? n : Int64(round(cumPart*n))
        push!(toReturn,data[ridx[current:final],:])
        current = (final +=1)
    end
    return toReturn
end

搭配使用：

julia> x = [1:10 11:20]
julia> y = collect(31:40)
julia> ((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.7,0.3])

你也可以分成三个或更多部分，并且要分区的数组数量也是可变的。

默认情况下它们也会随机播放，但您可以使用参数 shuffle...

来避免它

Answer 3

using Pkg Pkg.add("Lathe") using Lathe.preprocess: TrainTestSplit train, test = TrainTestSplit(df) 还有一个位置参数，在第二个位置，它需要一个百分比来分割。

将数据集拆分为 julia 中的训练和测试

Splitting datasets into train and test in julia

julia

train-test-split