我有一个 DataFrame,我需要从中获取所需的格式。我在下面放了更多细节

I have a DataFrame, I need a required format from it. I am putting more details below

我有以下格式的 DataFrame:

julia> using DataFrames, CSV

julia> str = """Roll,id,name,tag,date1,date2,date3,date4,date5
       0001443646-19-000093,1443646,Rohit,Student,20170331,20180331,20190331
       0001443646-20-000086,1443646,Rohit,Student,20190331,20200331,20180331
       0001443646-21-000079,1443646,Rohit,Student,20190331,20210331,20200331
       0001683168-20-001021,1622879,Mohit,Teacher,20191231
       0001683168-21-001161,1622879,Mohit,Teacher,20201231,20191231
       """
"Roll,id,name,tag,date1,date2,date3,date4,date5\n0001443646-19-000093,1443646,Rohit,Student,20170331,20180331,20190331\n0001443646-20-000086,1443646,Rohit,Student,20190331,20200331,20180331\n0001443646-21-000079,1443646,Rohit,Student,20190331,20210331,20200331\n0001683168-20-001021,1622879,Mohit,Teacher,20191231\n0001683168-21-001161,1622879,Mohit,Teacher,20201231,20191231\n"

julia> df = CSV.read(IOBuffer(str), DataFrame)
5×9 DataFrame
 Row │ Roll                  id       name    tag      date1     date2     date3     date4    date5
     │ String                Int64    String  String   Int64     Int64?    Int64?    Missing  Missing
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0001443646-19-000093  1443646  Rohit   Student  20170331  20180331  20190331  missing  missing
   2 │ 0001443646-20-000086  1443646  Rohit   Student  20190331  20200331  20180331  missing  missing
   3 │ 0001443646-21-000079  1443646  Rohit   Student  20190331  20210331  20200331  missing  missing
   4 │ 0001683168-20-001021  1622879  Mohit   Teacher  20191231   missing   missing  missing  missing
   5 │ 0001683168-21-001161  1622879  Mohit   Teacher  20201231  20191231   missing  missing  missing

现在,首先,让我们谈谈 df,让我们谈谈 :name 栏中的第一个名字,即 Rohit。因此,idtagsname Rohit 相同,但 Roll 中的字符串不同,并且该特定名称有一些重复日期。那么,现在让我们看一下简化的 DataFrame df1,我将谈谈我是如何手动操作它的:

现在,我要实现的目标DataFrame是:

julia> df1 = CSV.read(IOBuffer(str), DataFrame)
2×9 DataFrame
 Row │ Roll                  id       name    tag      date1     date2     date3     date4     date5
     │ String                Int64    String  String   Int64     Int64     Int64?    Int64?    Int64?
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0001443646-21-000079  1443646  Rohit   Student  20210331  20210331  20190331  20180331  20170331
   2 │ 0001683168-21-001161  1622879  Mohit   Teacher  20201231  20191231   missing   missing   missing

让我们关注一些要点:

  1. 首先让我们关注 :Roll 列。在此列中,您将看到三个不同的滚动字符串,例如,0001443646-21-000079 现在在 21 之间,与其他两个字符串 0001443646-20-000086 相比,我们需要考虑更高的优先级,和 0001443646-19-000093 我们将分别得到 2019。所以根据 0001443646-21-000079 提取的日期将是:
20190331,20210331,20200331,missing,missing

但是现在我再做一次降序排列(意思是我们将把最近几年放在前面,然后是过去几年),例如,

20210331,20200331,20190331,missing,missing

其他两个字符串相同:

0001443646-19-000093,1443646,Rohit,Student,20190331,20180331,20170331,missing,missing
0001443646-20-000086,1443646,Rohit,Student,20200331,20190331,20180331,missing,missing

现在,因为我们将首先考虑更高优先级的 Roll,所以如果我们遇到来自其他两个 roll 字符串的重复项,我们将不会考虑它。名称 Rohit 的最终格式如下所示:

0001443646-21-000079,1443646,Rohit,Student,20210331,20210331,20190331,20180331,20170331

如果您在其他两个 Roll 字符串中看到我们有重复项。现在,我想将整个想法转换成脚本格式。我遇到了一些困难。有人可以帮助我吗?

我假设您使用的是 DataFrames.jl 1.2.2 和 CSV.jl 0.9.2(当前版本)。

然后做:

julia> df = CSV.read(IOBuffer(str), DataFrame, stringtype=String)
5×9 DataFrame
 Row │ Roll                  id       name    tag      date1     date2     date3     date4    date5
     │ String                Int64    String  String   Int64     String    String    String   String
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0001443646-19-000093  1443646  Rohit   Student  20170331  20180331  20190331  missing  missing
   2 │ 0001443646-20-000086  1443646  Rohit   Student  20190331  20200331  20180331  missing  missing
   3 │ 0001443646-21-000079  1443646  Rohit   Student  20190331  20210331  20200331  missing  missing
   4 │ 0001683168-20-001021  1622879  Mohit   Teacher  20191231  missing   missing   missing  missing
   5 │ 0001683168-21-001161  1622879  Mohit   Teacher  20201231  20191231  missing   missing  missing

julia> df.date1 = string.(df.date1) # need to do it as in your example you are mixing string and integer columns later
5-element Vector{String}:
 "20170331"
 "20190331"
 "20190331"
 "20191231"
 "20201231"

julia> combine(groupby(df, :name)) do sdf
       df2 = sort(sdf, by=x->parse(Int, split(x,"-")[2]), rev=true)
       dfr = first(df2)
       pos=findfirst(==("missing"), Vector(dfr[r"date"]))
       for i in 2:nrow(df2), n in names(df2, r"date")
           if !(df2[i, n] in Vector(dfr[r"date"])) && df2[i, n] != "missing"
               pos > 5 && error("unsupported data")
               dfr[r"date"][pos] = df2[i, n]
               pos += 1
           end
       end
       dfr[r"date"][1:pos-1] = sort(Vector(dfr[r"date"][1:pos-1]), rev=true)
       return dfr
       end
2×9 DataFrame
 Row │ name    Roll                  id       tag      date1     date2     date3     date4     date5
     │ String  String                Int64    String   String    String    String    String    String
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Rohit   0001443646-21-000079  1443646  Student  20210331  20200331  20190331  20180331  20170331
   2 │ Mohit   0001683168-21-001161  1622879  Teacher  20201231  20191231  missing   missing   missing

这不是优化的代码 - 我只是在阅读您的描述时写的。我希望它向您展示了如何完成任务。


如果你有 missing 没有 "missing" 做例如:

julia> df.date4 = missings(Int, 5)
5-element Vector{Union{Missing, Int64}}:
 missing
 missing
 missing
 missing
 missing

julia> df.date5 = missings(Int, 5)
5-element Vector{Union{Missing, Int64}}:
 missing
 missing
 missing
 missing
 missing

julia> combine(groupby(df, :name)) do sdf
           df2 = sort(sdf, by=x->parse(Int, split(x,"-")[2]), rev=true)
           dfr = first(df2)
           pos=findfirst(ismissing, Vector(dfr[r"date"]))
           for i in 2:nrow(df2), n in names(df2, r"date")
               if !ismissing(df2[i, n]) && !(any(x -> isequal(df2[i, n], x), dfr[r"date"]))
                   pos > 5 && error("unsupported data")
                   dfr[r"date"][pos] = df2[i, n]
                   pos += 1
               end
           end
           dfr[r"date"][1:pos-1] = sort(Vector(dfr[r"date"][1:pos-1]), rev=true)
           return dfr
       end
2×9 DataFrame
 Row │ name     Roll                  id       tag      date1     date2     date3     date4     date5
     │ String7  String31              Int64    String7  Int64     Int64?    Int64?    Int64?    Int64?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Rohit    0001443646-21-000079  1443646  Student  20210331  20200331  20190331  20180331  20170331
   2 │ Mohit    0001683168-21-001161  1622879  Teacher  20201231  20191231   missing   missing   missing