Julia Dataframe 分组依据和数据透视表函数
Julia Dataframe group by and pivot tables functions
您如何使用 Julia Dataframes 进行分组和数据透视 table?
假设我有 Dataframe
using DataFrames
df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"],
Class = ["H","L","H","L","L","H", "H","L","L","M"],
Address = ["12 Silver","10 Fak","12 Silver","1 North","10 Fak","2 Fake", "1 Red","1 Dog","2 Fake","1 White"],
Score = ["4","5","3","2","1","5","4","3","2","1"])
并且我想执行以下操作:
1) 具有 Location
和 Class
的枢轴 table 应该输出
Class H L M
Location
DC 0 0 1
NY 2 1 0
SF 1 2 0
TX 1 2 0
2) 按 "Location" 分组并计算该组中的记录数,应输出
Pop
DC 1
NY 3
SF 3
TX 3
您可以使用 unstack
完成大部分工作(DataFrames 没有索引,因此 Class 必须保留为一列,而不是在 pandas 它所在的位置将是一个索引),这似乎是 DataFrames.jl 对 pivot_table
:
的回答
julia> unstack(df, :Location, :Class, :Score)
WARNING: Duplicate entries in unstack.
4x4 DataFrames.DataFrame
| Row | Class | H | L | M |
|-----|-------|-----|-----|-----|
| 1 | "DC" | NA | NA | "1" |
| 2 | "NY" | "3" | "2" | NA |
| 3 | "SF" | "5" | "1" | NA |
| 4 | "TX" | "4" | "2" | NA |
我不知道你怎么fillna
这里(unstack没有这个选项)...
您可以使用 by
和 nrows
(行数)方法进行分组:
julia> by(df, :Location, nrow)
4x2 DataFrames.DataFrame
| Row | Location | x1 |
|-----|----------|----|
| 1 | "DC" | 1 |
| 2 | "NY" | 3 |
| 3 | "SF" | 3 |
| 4 | "TX" | 3 |
(1) 这是我尝试创建一个枢轴 table。我使用 by() 按一列分组,然后计算函数中第二列因子的频率。
# Create pivot table from DataFrame.
# - df : DataFrame object
# - column1 : Column symbol used for row labels.
# - column2 : Column symbol used for column labels.
function pivot_table(df, column1, column2)
# For given DataArray and factor list, create single row DataFrame:
# ----------------------------------------
# | factor1 | factor2 | ...
# ----------------------------------------
# |freq of factor1|freq of factor1| ...
# ----------------------------------------
function frequency(data, factors)
# Convert factors to symbols.
factor_symbols::Vector{Symbol} = map(factor -> symbol(factor), factors)
# Convert frequency to fit the DataFrame constructor parameter type.
frequencies::Vector{Any} = map(frequency->[frequency], map(factor -> sum(data .== factor), factors))
DataFrame(frequencies, factor_symbols)
end
factors = sort(unique(df[column2]))
by(df, column1, x -> frequency(x[column2], factors))
end
示例:
julia> pivot_table(df, :Location, :Class)
4x4 DataFrames.DataFrame
| Row | Location | H | L | M |
|-----|----------|---|---|---|
| 1 | "DC" | 0 | 0 | 1 |
| 2 | "NY" | 2 | 1 | 0 |
| 3 | "SF" | 1 | 2 | 0 |
| 4 | "TX" | 1 | 2 | 0 |
(2)可以使用by和nrow。
julia> by(df, :Location, nrow)
4x2 DataFrames.DataFrame
| Row | Location | x1 |
|-----|----------|----|
| 1 | "DC" | 1 |
| 2 | "NY" | 3 |
| 3 | "SF" | 3 |
| 4 | "TX" | 3 |
对于问题的第 2 部分,您可以使用匿名函数和 return DataFrame 来命名新列,例如 count
:
julia> by(df, :Location, d -> DataFrame(count=nrow(d)))
4x2 DataFrames.DataFrame
| Row | Location | count |
|-----|----------|-------|
| 1 | "DC" | 1 |
| 2 | "NY" | 3 |
| 3 | "SF" | 3 |
| 4 | "TX" | 3 |
包 FreqTable.jl 解决了这个问题:
>using FreqTables
>show(freqtable(df,:Location,:Class))
4×3 Named Array{Int64,2}
Location ╲ Class │ H L M
─────────────────┼────────
DC │ 0 0 1
NY │ 2 1 0
SF │ 1 2 0
TX │ 1 2 0
使用为 this SO question 开发的 pivot (df, rowFields, colField, valuesField; <keyword arguments>)
函数,您可以:
julia> df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"],
Class = ["H","L","H","L","L","H", "H","L","L","M"],
Address = ["12 Silver","10 Fak","12 Silver","1 North","10 Fak","2 Fake", "1 Red","1 Dog","2 Fake","1 White"],
Score = ["4","5","3","2","1","5","4","3","2","1"])
第一个问题:
julia> df_piv = pivot(df,[:Location],:Class,:Score,ops=length)
julia> [df_piv[isna(df_piv[i]), i] = 0 for i in names(df_piv)] # remove NA values across whole df
julia> df_piv
4×4 DataFrames.DataFrame
│ Row │ Location │ H │ L │ M │
├─────┼──────────┼───┼───┼───┤
│ 1 │ "DC" │ 0 │ 0 │ 1 │
│ 2 │ "NY" │ 2 │ 1 │ 0 │
│ 3 │ "SF" │ 1 │ 2 │ 0 │
│ 4 │ "TX" │ 1 │ 2 │ 0 │
第二个问题:
julia> df[:pop]="Pop" # add a dummy column with constant values
julia> pivot(df,[:Location],:pop,:Score,ops=length)
4×2 DataFrames.DataFrame
│ Row │ Location │ Pop │
├─────┼──────────┼─────┤
│ 1 │ "DC" │ 1 │
│ 2 │ "NY" │ 3 │
│ 3 │ "SF" │ 3 │
│ 4 │ "TX" │ 3 │
您如何使用 Julia Dataframes 进行分组和数据透视 table?
假设我有 Dataframe
using DataFrames
df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"],
Class = ["H","L","H","L","L","H", "H","L","L","M"],
Address = ["12 Silver","10 Fak","12 Silver","1 North","10 Fak","2 Fake", "1 Red","1 Dog","2 Fake","1 White"],
Score = ["4","5","3","2","1","5","4","3","2","1"])
并且我想执行以下操作:
1) 具有 Location
和 Class
的枢轴 table 应该输出
Class H L M
Location
DC 0 0 1
NY 2 1 0
SF 1 2 0
TX 1 2 0
2) 按 "Location" 分组并计算该组中的记录数,应输出
Pop
DC 1
NY 3
SF 3
TX 3
您可以使用 unstack
完成大部分工作(DataFrames 没有索引,因此 Class 必须保留为一列,而不是在 pandas 它所在的位置将是一个索引),这似乎是 DataFrames.jl 对 pivot_table
:
julia> unstack(df, :Location, :Class, :Score)
WARNING: Duplicate entries in unstack.
4x4 DataFrames.DataFrame
| Row | Class | H | L | M |
|-----|-------|-----|-----|-----|
| 1 | "DC" | NA | NA | "1" |
| 2 | "NY" | "3" | "2" | NA |
| 3 | "SF" | "5" | "1" | NA |
| 4 | "TX" | "4" | "2" | NA |
我不知道你怎么fillna
这里(unstack没有这个选项)...
您可以使用 by
和 nrows
(行数)方法进行分组:
julia> by(df, :Location, nrow)
4x2 DataFrames.DataFrame
| Row | Location | x1 |
|-----|----------|----|
| 1 | "DC" | 1 |
| 2 | "NY" | 3 |
| 3 | "SF" | 3 |
| 4 | "TX" | 3 |
(1) 这是我尝试创建一个枢轴 table。我使用 by() 按一列分组,然后计算函数中第二列因子的频率。
# Create pivot table from DataFrame.
# - df : DataFrame object
# - column1 : Column symbol used for row labels.
# - column2 : Column symbol used for column labels.
function pivot_table(df, column1, column2)
# For given DataArray and factor list, create single row DataFrame:
# ----------------------------------------
# | factor1 | factor2 | ...
# ----------------------------------------
# |freq of factor1|freq of factor1| ...
# ----------------------------------------
function frequency(data, factors)
# Convert factors to symbols.
factor_symbols::Vector{Symbol} = map(factor -> symbol(factor), factors)
# Convert frequency to fit the DataFrame constructor parameter type.
frequencies::Vector{Any} = map(frequency->[frequency], map(factor -> sum(data .== factor), factors))
DataFrame(frequencies, factor_symbols)
end
factors = sort(unique(df[column2]))
by(df, column1, x -> frequency(x[column2], factors))
end
示例:
julia> pivot_table(df, :Location, :Class)
4x4 DataFrames.DataFrame
| Row | Location | H | L | M |
|-----|----------|---|---|---|
| 1 | "DC" | 0 | 0 | 1 |
| 2 | "NY" | 2 | 1 | 0 |
| 3 | "SF" | 1 | 2 | 0 |
| 4 | "TX" | 1 | 2 | 0 |
(2)可以使用by和nrow。
julia> by(df, :Location, nrow)
4x2 DataFrames.DataFrame
| Row | Location | x1 |
|-----|----------|----|
| 1 | "DC" | 1 |
| 2 | "NY" | 3 |
| 3 | "SF" | 3 |
| 4 | "TX" | 3 |
对于问题的第 2 部分,您可以使用匿名函数和 return DataFrame 来命名新列,例如 count
:
julia> by(df, :Location, d -> DataFrame(count=nrow(d)))
4x2 DataFrames.DataFrame
| Row | Location | count |
|-----|----------|-------|
| 1 | "DC" | 1 |
| 2 | "NY" | 3 |
| 3 | "SF" | 3 |
| 4 | "TX" | 3 |
包 FreqTable.jl 解决了这个问题:
>using FreqTables
>show(freqtable(df,:Location,:Class))
4×3 Named Array{Int64,2}
Location ╲ Class │ H L M
─────────────────┼────────
DC │ 0 0 1
NY │ 2 1 0
SF │ 1 2 0
TX │ 1 2 0
使用为 this SO question 开发的 pivot (df, rowFields, colField, valuesField; <keyword arguments>)
函数,您可以:
julia> df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"],
Class = ["H","L","H","L","L","H", "H","L","L","M"],
Address = ["12 Silver","10 Fak","12 Silver","1 North","10 Fak","2 Fake", "1 Red","1 Dog","2 Fake","1 White"],
Score = ["4","5","3","2","1","5","4","3","2","1"])
第一个问题:
julia> df_piv = pivot(df,[:Location],:Class,:Score,ops=length)
julia> [df_piv[isna(df_piv[i]), i] = 0 for i in names(df_piv)] # remove NA values across whole df
julia> df_piv
4×4 DataFrames.DataFrame
│ Row │ Location │ H │ L │ M │
├─────┼──────────┼───┼───┼───┤
│ 1 │ "DC" │ 0 │ 0 │ 1 │
│ 2 │ "NY" │ 2 │ 1 │ 0 │
│ 3 │ "SF" │ 1 │ 2 │ 0 │
│ 4 │ "TX" │ 1 │ 2 │ 0 │
第二个问题:
julia> df[:pop]="Pop" # add a dummy column with constant values
julia> pivot(df,[:Location],:pop,:Score,ops=length)
4×2 DataFrames.DataFrame
│ Row │ Location │ Pop │
├─────┼──────────┼─────┤
│ 1 │ "DC" │ 1 │
│ 2 │ "NY" │ 3 │
│ 3 │ "SF" │ 3 │
│ 4 │ "TX" │ 3 │