检查变量的值是否属于集合 bootstrap

Question

我有一个整数数组

theIndex = [ 1 2 6 7 17 2]

我有一个包含一列 dataset[:id] 的数据框，其中包含整数 say

dataset = DataFrame(id=[ 1, 1, 2, 2, 3, 3, 3, 4, 4, 4])

我想 select 数据集中属于该索引的所有观测值。如果它们在索引中出现两次（或更多），我想 select 它们两次（或更多）

目前，我正在以愚蠢的方式进行。

theIndex = [ 1 2 6 7 17 2]
dataset = DataFrame(id=[ 1, 1, 2, 2, 3, 3, 3, 4, 4, 4])
dataset2 = DataFrame(id=Int64[])
for ii1=1:size(theIndex,2)
    for ii2=1:size(dataset[:id],1)
        any(i->i.==dataset[ii2,:id],theIndex[ii1]) ? 
        push!(dataset2,dataset[ii2,:id]) : nothing
    end
end

还有更优雅的解决方案吗？

Answer 1

根据我之前的评论，您正在寻找 findin 函数。

julia> Ind = findin( dataset[:id], theIndex); # return indices of elements in
                                              # dataset[:id] that occur in
                                              # theIndex

julia> dataset[:id][Ind]
4-element DataArrays.DataArray{Int64,1}:
 1
 1
 2
 2

（或者如果您希望以 SubDataFrame 的形式返回结果/查看您的数据集，您可以 SubDataFrame(dataset, Ind) 等）

编辑： 根据评论，为确保考虑 theIndex 中的重复，需要单独附加每个元素的样本：

Ind = []; for i in theIndex; append!(Ind, findin(dataset[:id], i)); end

Ind 然后可用于创建数组或 SubDataFrame，如上所示。

编辑 2:

julia> @time dataset2 = DataFrame(id=Int64[])
       for ii1=1:size(theIndex,2)
           for ii2=1:size(dataset[:id],1)
               any(i->i.==dataset[ii2,:id],theIndex[ii1]) && 
               push!(dataset2,dataset[ii2,:id])
           end
       end
  0.000016 seconds (24 allocations: 1.594 KiB)

julia> @time Ind = []; for i in theIndex; append!(Ind, findin(dataset[:id], i)); end
  0.000002 seconds (5 allocations: 240 bytes)

_{（关于全球范围 bla bla 基准测试的通常警告性咆哮）}

Answer 2

本质上，这个问题想要计算theIndex和dataset之间的SQL JOIN。不幸的是，此功能并未在 DataFrames 内部完全实现。因此，这是为此目的对 JOIN 的快速（有效）模拟：

using DataStructures

sort!(dataset, cols=:id]
j = 1
newvec = Vector{Int}() 
for (val,cnt) in SortedDict(countmap(theIndex))
    while j<=nrow(dataset)
        dataset[j,:id] > val && break
        dataset[j,:id] == val && append!(newvec,fill(j,cnt))
        j += 1
    end
end
dataset2 = dataset[newvec,:]

DataStructures 包用于 SortedDict。这种实现应该比其他多循环方法更有效。

检查变量的值是否属于集合 bootstrap

checking whether a value of a variables belong to a set bootstrap

subset

any

dataframe

julia

bootstrap-4