我有一个 DataFrame,我需要从中获取所需的格式。我在下面放了更多细节
I have a DataFrame, I need a required format from it. I am putting more details below
我有以下格式的 DataFrame:
julia> using DataFrames, CSV
julia> str = """Roll,id,name,tag,date1,date2,date3,date4,date5
0001443646-19-000093,1443646,Rohit,Student,20170331,20180331,20190331
0001443646-20-000086,1443646,Rohit,Student,20190331,20200331,20180331
0001443646-21-000079,1443646,Rohit,Student,20190331,20210331,20200331
0001683168-20-001021,1622879,Mohit,Teacher,20191231
0001683168-21-001161,1622879,Mohit,Teacher,20201231,20191231
"""
"Roll,id,name,tag,date1,date2,date3,date4,date5\n0001443646-19-000093,1443646,Rohit,Student,20170331,20180331,20190331\n0001443646-20-000086,1443646,Rohit,Student,20190331,20200331,20180331\n0001443646-21-000079,1443646,Rohit,Student,20190331,20210331,20200331\n0001683168-20-001021,1622879,Mohit,Teacher,20191231\n0001683168-21-001161,1622879,Mohit,Teacher,20201231,20191231\n"
julia> df = CSV.read(IOBuffer(str), DataFrame)
5×9 DataFrame
Row │ Roll id name tag date1 date2 date3 date4 date5
│ String Int64 String String Int64 Int64? Int64? Missing Missing
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0001443646-19-000093 1443646 Rohit Student 20170331 20180331 20190331 missing missing
2 │ 0001443646-20-000086 1443646 Rohit Student 20190331 20200331 20180331 missing missing
3 │ 0001443646-21-000079 1443646 Rohit Student 20190331 20210331 20200331 missing missing
4 │ 0001683168-20-001021 1622879 Mohit Teacher 20191231 missing missing missing missing
5 │ 0001683168-21-001161 1622879 Mohit Teacher 20201231 20191231 missing missing missing
现在,首先,让我们谈谈 df
,让我们谈谈 :name
栏中的第一个名字,即 Rohit
。因此,id
和 tags
与 name
Rohit
相同,但 Roll 中的字符串不同,并且该特定名称有一些重复日期。那么,现在让我们看一下简化的 DataFrame df1
,我将谈谈我是如何手动操作它的:
现在,我要实现的目标DataFrame是:
julia> df1 = CSV.read(IOBuffer(str), DataFrame)
2×9 DataFrame
Row │ Roll id name tag date1 date2 date3 date4 date5
│ String Int64 String String Int64 Int64 Int64? Int64? Int64?
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0001443646-21-000079 1443646 Rohit Student 20210331 20210331 20190331 20180331 20170331
2 │ 0001683168-21-001161 1622879 Mohit Teacher 20201231 20191231 missing missing missing
让我们关注一些要点:
- 首先让我们关注
:Roll
列。在此列中,您将看到三个不同的滚动字符串,例如,0001443646-21-000079
现在在 21
之间,与其他两个字符串 0001443646-20-000086
相比,我们需要考虑更高的优先级,和 0001443646-19-000093
我们将分别得到 20
和 19
。所以根据 0001443646-21-000079
提取的日期将是:
20190331,20210331,20200331,missing,missing
但是现在我再做一次降序排列(意思是我们将把最近几年放在前面,然后是过去几年),例如,
20210331,20200331,20190331,missing,missing
其他两个字符串相同:
0001443646-19-000093,1443646,Rohit,Student,20190331,20180331,20170331,missing,missing
0001443646-20-000086,1443646,Rohit,Student,20200331,20190331,20180331,missing,missing
现在,因为我们将首先考虑更高优先级的 Roll,所以如果我们遇到来自其他两个 roll 字符串的重复项,我们将不会考虑它。名称 Rohit
的最终格式如下所示:
0001443646-21-000079,1443646,Rohit,Student,20210331,20210331,20190331,20180331,20170331
如果您在其他两个 Roll 字符串中看到我们有重复项。现在,我想将整个想法转换成脚本格式。我遇到了一些困难。有人可以帮助我吗?
我假设您使用的是 DataFrames.jl 1.2.2 和 CSV.jl 0.9.2(当前版本)。
然后做:
julia> df = CSV.read(IOBuffer(str), DataFrame, stringtype=String)
5×9 DataFrame
Row │ Roll id name tag date1 date2 date3 date4 date5
│ String Int64 String String Int64 String String String String
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0001443646-19-000093 1443646 Rohit Student 20170331 20180331 20190331 missing missing
2 │ 0001443646-20-000086 1443646 Rohit Student 20190331 20200331 20180331 missing missing
3 │ 0001443646-21-000079 1443646 Rohit Student 20190331 20210331 20200331 missing missing
4 │ 0001683168-20-001021 1622879 Mohit Teacher 20191231 missing missing missing missing
5 │ 0001683168-21-001161 1622879 Mohit Teacher 20201231 20191231 missing missing missing
julia> df.date1 = string.(df.date1) # need to do it as in your example you are mixing string and integer columns later
5-element Vector{String}:
"20170331"
"20190331"
"20190331"
"20191231"
"20201231"
julia> combine(groupby(df, :name)) do sdf
df2 = sort(sdf, by=x->parse(Int, split(x,"-")[2]), rev=true)
dfr = first(df2)
pos=findfirst(==("missing"), Vector(dfr[r"date"]))
for i in 2:nrow(df2), n in names(df2, r"date")
if !(df2[i, n] in Vector(dfr[r"date"])) && df2[i, n] != "missing"
pos > 5 && error("unsupported data")
dfr[r"date"][pos] = df2[i, n]
pos += 1
end
end
dfr[r"date"][1:pos-1] = sort(Vector(dfr[r"date"][1:pos-1]), rev=true)
return dfr
end
2×9 DataFrame
Row │ name Roll id tag date1 date2 date3 date4 date5
│ String String Int64 String String String String String String
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Rohit 0001443646-21-000079 1443646 Student 20210331 20200331 20190331 20180331 20170331
2 │ Mohit 0001683168-21-001161 1622879 Teacher 20201231 20191231 missing missing missing
这不是优化的代码 - 我只是在阅读您的描述时写的。我希望它向您展示了如何完成任务。
如果你有 missing
没有 "missing"
做例如:
julia> df.date4 = missings(Int, 5)
5-element Vector{Union{Missing, Int64}}:
missing
missing
missing
missing
missing
julia> df.date5 = missings(Int, 5)
5-element Vector{Union{Missing, Int64}}:
missing
missing
missing
missing
missing
julia> combine(groupby(df, :name)) do sdf
df2 = sort(sdf, by=x->parse(Int, split(x,"-")[2]), rev=true)
dfr = first(df2)
pos=findfirst(ismissing, Vector(dfr[r"date"]))
for i in 2:nrow(df2), n in names(df2, r"date")
if !ismissing(df2[i, n]) && !(any(x -> isequal(df2[i, n], x), dfr[r"date"]))
pos > 5 && error("unsupported data")
dfr[r"date"][pos] = df2[i, n]
pos += 1
end
end
dfr[r"date"][1:pos-1] = sort(Vector(dfr[r"date"][1:pos-1]), rev=true)
return dfr
end
2×9 DataFrame
Row │ name Roll id tag date1 date2 date3 date4 date5
│ String7 String31 Int64 String7 Int64 Int64? Int64? Int64? Int64?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Rohit 0001443646-21-000079 1443646 Student 20210331 20200331 20190331 20180331 20170331
2 │ Mohit 0001683168-21-001161 1622879 Teacher 20201231 20191231 missing missing missing
我有以下格式的 DataFrame:
julia> using DataFrames, CSV
julia> str = """Roll,id,name,tag,date1,date2,date3,date4,date5
0001443646-19-000093,1443646,Rohit,Student,20170331,20180331,20190331
0001443646-20-000086,1443646,Rohit,Student,20190331,20200331,20180331
0001443646-21-000079,1443646,Rohit,Student,20190331,20210331,20200331
0001683168-20-001021,1622879,Mohit,Teacher,20191231
0001683168-21-001161,1622879,Mohit,Teacher,20201231,20191231
"""
"Roll,id,name,tag,date1,date2,date3,date4,date5\n0001443646-19-000093,1443646,Rohit,Student,20170331,20180331,20190331\n0001443646-20-000086,1443646,Rohit,Student,20190331,20200331,20180331\n0001443646-21-000079,1443646,Rohit,Student,20190331,20210331,20200331\n0001683168-20-001021,1622879,Mohit,Teacher,20191231\n0001683168-21-001161,1622879,Mohit,Teacher,20201231,20191231\n"
julia> df = CSV.read(IOBuffer(str), DataFrame)
5×9 DataFrame
Row │ Roll id name tag date1 date2 date3 date4 date5
│ String Int64 String String Int64 Int64? Int64? Missing Missing
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0001443646-19-000093 1443646 Rohit Student 20170331 20180331 20190331 missing missing
2 │ 0001443646-20-000086 1443646 Rohit Student 20190331 20200331 20180331 missing missing
3 │ 0001443646-21-000079 1443646 Rohit Student 20190331 20210331 20200331 missing missing
4 │ 0001683168-20-001021 1622879 Mohit Teacher 20191231 missing missing missing missing
5 │ 0001683168-21-001161 1622879 Mohit Teacher 20201231 20191231 missing missing missing
现在,首先,让我们谈谈 df
,让我们谈谈 :name
栏中的第一个名字,即 Rohit
。因此,id
和 tags
与 name
Rohit
相同,但 Roll 中的字符串不同,并且该特定名称有一些重复日期。那么,现在让我们看一下简化的 DataFrame df1
,我将谈谈我是如何手动操作它的:
现在,我要实现的目标DataFrame是:
julia> df1 = CSV.read(IOBuffer(str), DataFrame)
2×9 DataFrame
Row │ Roll id name tag date1 date2 date3 date4 date5
│ String Int64 String String Int64 Int64 Int64? Int64? Int64?
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0001443646-21-000079 1443646 Rohit Student 20210331 20210331 20190331 20180331 20170331
2 │ 0001683168-21-001161 1622879 Mohit Teacher 20201231 20191231 missing missing missing
让我们关注一些要点:
- 首先让我们关注
:Roll
列。在此列中,您将看到三个不同的滚动字符串,例如,0001443646-21-000079
现在在21
之间,与其他两个字符串0001443646-20-000086
相比,我们需要考虑更高的优先级,和0001443646-19-000093
我们将分别得到20
和19
。所以根据0001443646-21-000079
提取的日期将是:
20190331,20210331,20200331,missing,missing
但是现在我再做一次降序排列(意思是我们将把最近几年放在前面,然后是过去几年),例如,
20210331,20200331,20190331,missing,missing
其他两个字符串相同:
0001443646-19-000093,1443646,Rohit,Student,20190331,20180331,20170331,missing,missing
0001443646-20-000086,1443646,Rohit,Student,20200331,20190331,20180331,missing,missing
现在,因为我们将首先考虑更高优先级的 Roll,所以如果我们遇到来自其他两个 roll 字符串的重复项,我们将不会考虑它。名称 Rohit
的最终格式如下所示:
0001443646-21-000079,1443646,Rohit,Student,20210331,20210331,20190331,20180331,20170331
如果您在其他两个 Roll 字符串中看到我们有重复项。现在,我想将整个想法转换成脚本格式。我遇到了一些困难。有人可以帮助我吗?
我假设您使用的是 DataFrames.jl 1.2.2 和 CSV.jl 0.9.2(当前版本)。
然后做:
julia> df = CSV.read(IOBuffer(str), DataFrame, stringtype=String)
5×9 DataFrame
Row │ Roll id name tag date1 date2 date3 date4 date5
│ String Int64 String String Int64 String String String String
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0001443646-19-000093 1443646 Rohit Student 20170331 20180331 20190331 missing missing
2 │ 0001443646-20-000086 1443646 Rohit Student 20190331 20200331 20180331 missing missing
3 │ 0001443646-21-000079 1443646 Rohit Student 20190331 20210331 20200331 missing missing
4 │ 0001683168-20-001021 1622879 Mohit Teacher 20191231 missing missing missing missing
5 │ 0001683168-21-001161 1622879 Mohit Teacher 20201231 20191231 missing missing missing
julia> df.date1 = string.(df.date1) # need to do it as in your example you are mixing string and integer columns later
5-element Vector{String}:
"20170331"
"20190331"
"20190331"
"20191231"
"20201231"
julia> combine(groupby(df, :name)) do sdf
df2 = sort(sdf, by=x->parse(Int, split(x,"-")[2]), rev=true)
dfr = first(df2)
pos=findfirst(==("missing"), Vector(dfr[r"date"]))
for i in 2:nrow(df2), n in names(df2, r"date")
if !(df2[i, n] in Vector(dfr[r"date"])) && df2[i, n] != "missing"
pos > 5 && error("unsupported data")
dfr[r"date"][pos] = df2[i, n]
pos += 1
end
end
dfr[r"date"][1:pos-1] = sort(Vector(dfr[r"date"][1:pos-1]), rev=true)
return dfr
end
2×9 DataFrame
Row │ name Roll id tag date1 date2 date3 date4 date5
│ String String Int64 String String String String String String
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Rohit 0001443646-21-000079 1443646 Student 20210331 20200331 20190331 20180331 20170331
2 │ Mohit 0001683168-21-001161 1622879 Teacher 20201231 20191231 missing missing missing
这不是优化的代码 - 我只是在阅读您的描述时写的。我希望它向您展示了如何完成任务。
如果你有 missing
没有 "missing"
做例如:
julia> df.date4 = missings(Int, 5)
5-element Vector{Union{Missing, Int64}}:
missing
missing
missing
missing
missing
julia> df.date5 = missings(Int, 5)
5-element Vector{Union{Missing, Int64}}:
missing
missing
missing
missing
missing
julia> combine(groupby(df, :name)) do sdf
df2 = sort(sdf, by=x->parse(Int, split(x,"-")[2]), rev=true)
dfr = first(df2)
pos=findfirst(ismissing, Vector(dfr[r"date"]))
for i in 2:nrow(df2), n in names(df2, r"date")
if !ismissing(df2[i, n]) && !(any(x -> isequal(df2[i, n], x), dfr[r"date"]))
pos > 5 && error("unsupported data")
dfr[r"date"][pos] = df2[i, n]
pos += 1
end
end
dfr[r"date"][1:pos-1] = sort(Vector(dfr[r"date"][1:pos-1]), rev=true)
return dfr
end
2×9 DataFrame
Row │ name Roll id tag date1 date2 date3 date4 date5
│ String7 String31 Int64 String7 Int64 Int64? Int64? Int64? Int64?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Rohit 0001443646-21-000079 1443646 Student 20210331 20200331 20190331 20180331 20170331
2 │ Mohit 0001683168-21-001161 1622879 Teacher 20201231 20191231 missing missing missing