惯用 R 用于拆分可以拆分为 list/vector 且长度不规则的列,在数据帧或等效项中?
Idiomatic R for splitting a column that may be splitted into list/vector with irregular length, in a dataframe or equivalent?
我正在努力在 R 中实现以下要求:
例如,对于以下数据帧:
Multiple_rows <- data.frame(rbind(c("FLASH, SWAP.", "Memory: FLASH"), c("FLASH, , ,, SWAP.", "Memory: FLASH")))
colnames(Multiple_rows)<- c("VARIANTS", "STANDARD")
Multiple_rows
# VARIANTS STANDARD
#1 FLASH, SWAP. Memory: FLASH
#2 FLASH, , ,, SWAP. Memory: FLASH
- 对于每一行,拆分 VARIANTS 列的值,其中包含“,”作为分隔符。
- 对于字符串的结果列表,对于其中的每个元素,trim
前面和结尾的空白,并过滤掉那些空白
元素。
- 使用清理后的列表,为其中的每个元素创建
一行有两列:列 STANDARD 的原始值为
正在处理的行,列 VARIANT 中的元素
题。
- 将所有这些新创建的行收集到一个新的
table/dataframe.
因此对于上面的示例,我希望得到以下结果:
# VARIANT STANDARD
#1 "FLASH" "Memory FLASH"
#2 "SWAP." "Memory FLASH"
#3 "FLASH" "Memory FLASH"
#4 "SWAP." "Memory FLASH"
行的顺序无关紧要。
以下是我在 Clojure 中的实现(以说明我的要求):
(def Multple-rows
[{:VARIANTS "FLASH, SWAP.", :STANDARD "Memory: FLASH"}
{:VARIANTS "FLASH, , ,, SWAP.", :STANDARD "Memory: FLASH"}]) ;; This is my input. The input equivalent to a data frame with 2 column of "STANDARD", and "VARIANTS"
(defn variants-decomposed [a_map_raw] ;; process each row of the input data
(if-let [variants (:VARIANTS a_map_raw)]
(if (clojure.string/blank? variants)
[{:STANDARD (:STANDARD a_map_raw), :VARIANT nil}]
(let [standard (:STANDARD a_map_raw)
splitted (-> (clojure.string/split variants #"[,]")
((fn [list-variant] (map #(clojure.string/trim %) list-variant)), )
((fn [list-variant] (filter #(not (clojure.string/blank? %)) list-variant)), ))]
(if (seq splitted) ;; not empty
(for [v splitted] {:STANDARD standard, :VARIANT v})
[{:STANDARD standard, :VARIANT nil}]
)))
[{:VARIANT nil, :STANDARD (:STANDARD a_map_raw)}])
)
(defn multiple-variant-maps [map_of_variants] ;; the processing to each row and collect the result
(-> (map variants-decomposed map_of_variants)
((fn [list-of-vectors] (apply concat list-of-vectors)), )))))
(multiple-variant-maps Multple-rows) ;; This is my required result, which is equivalent to a data frame of 2 columns of "STANDARD", and "VARIANT".
以上计算结果如下:
({:STANDARD "Memory: FLASH", :VARIANT "FLASH"}
{:STANDARD "Memory: FLASH", :VARIANT "SWAP."}
{:STANDARD "Memory: FLASH", :VARIANT "FLASH"}
{:STANDARD "Memory: FLASH", :VARIANT "SWAP."})
我希望我能在 R 中以惯用的方式做同样的事情。到目前为止,我已经努力搞定了以下内容,但它仍然没有处理空白变体等不规则问题
dictionary.cleaned <- function(t) {
variants.splitted <- sapply(data.frame(do.call('rbind', strsplit(t[, "VARIANTS"], "[,]"))), str_trim)
melted <- melt(data.frame(dplyr::select(t, -VARIANTS), variants.splitted), id.vars = "STANDARD")
colnames(melted)[colnames(melted)== "value"] <- "VARIANT"
melted
}
以上 R 代码的结果如下:
> dictionary.cleaned(Multiple_rows)
STANDARD variable VARIANT
1 Memory: FLASH X1 FLASH
2 Memory: FLASH X1 FLASH
3 Memory: FLASH X2 SWAP.
4 Memory: FLASH X2
5 Memory: FLASH X3 FLASH
6 Memory: FLASH X3
7 Memory: FLASH X4 SWAP.
8 Memory: FLASH X4
9 Memory: FLASH X5 FLASH
10 Memory: FLASH X5 SWAP.
我想学习在 R 中更流利地编程 list/vector,R 相当于列表理解,相当于列表连接,以及正确地将列表表达式转换为数据框。
或者我可能需要学习 R 处理如此复杂或不规则数据的范例。 (R 在处理结构整齐的向量方面非常优雅。)
或者,我应该为正确的工作使用正确的工具,这样较低级别的数据争论可能不适合 R?
感谢您的帮助或指点!
于
试试这个:
# replace spaces with blanks in VARIANTS column
Multiple_rows$VARIANTS <- gsub(" ", "", as.character(Multiple_rows$VARIANTS))
# replace repeated commas with a single comma
Multiple_rows$VARIANTS <- gsub(",+", ",", as.character(Multiple_rows$VARIANTS))
VARIANTS <- unlist(strsplit(Multiple_rows$VARIANTS, ","))
STANDARD <- rep(Multiple_rows$STANDARD,
sapply(strsplit(Multiple_rows$VARIANTS, ","), length))
Multiple_rows <- data.frame(VARIANTS, STANDARD)
# VARIANTS STANDARD
#1 FLASH Memory: FLASH
#2 SWAP. Memory: FLASH
#3 FLASH Memory: FLASH
#4 SWAP. Memory: FLASH
这里有两个可供考虑的备选方案。
第一个使用我的 "splitstackshape" 包中的 cSplit
。它returns一个data.table
:
library(splitstackshape)
cSplit(Multiple_rows, "VARIANTS", ",", "long")[VARIANTS != ""]
# VARIANTS STANDARD
# 1: FLASH Memory: FLASH
# 2: SWAP. Memory: FLASH
# 3: FLASH Memory: FLASH
# 4: SWAP. Memory: FLASH
第二个使用 "dplyr" 和 "tidyr",加载 "stringi" 以修剪字符串:
library(dplyr)
library(tidyr)
library(stringi)
Multiple_rows %>%
mutate(VARIANTS = lapply(strsplit(as.character(VARIANTS), ","), stri_trim)) %>%
unnest(VARIANTS) %>%
filter(VARIANTS != "")
# VARIANTS STANDARD
# 1 FLASH Memory: FLASH
# 2 SWAP. Memory: FLASH
# 3 FLASH Memory: FLASH
# 4 SWAP. Memory: FLASH
我正在努力在 R 中实现以下要求:
例如,对于以下数据帧:
Multiple_rows <- data.frame(rbind(c("FLASH, SWAP.", "Memory: FLASH"), c("FLASH, , ,, SWAP.", "Memory: FLASH")))
colnames(Multiple_rows)<- c("VARIANTS", "STANDARD")
Multiple_rows
# VARIANTS STANDARD
#1 FLASH, SWAP. Memory: FLASH
#2 FLASH, , ,, SWAP. Memory: FLASH
- 对于每一行,拆分 VARIANTS 列的值,其中包含“,”作为分隔符。
- 对于字符串的结果列表,对于其中的每个元素,trim 前面和结尾的空白,并过滤掉那些空白 元素。
- 使用清理后的列表,为其中的每个元素创建 一行有两列:列 STANDARD 的原始值为 正在处理的行,列 VARIANT 中的元素 题。
- 将所有这些新创建的行收集到一个新的 table/dataframe.
因此对于上面的示例,我希望得到以下结果:
# VARIANT STANDARD
#1 "FLASH" "Memory FLASH"
#2 "SWAP." "Memory FLASH"
#3 "FLASH" "Memory FLASH"
#4 "SWAP." "Memory FLASH"
行的顺序无关紧要。
以下是我在 Clojure 中的实现(以说明我的要求):
(def Multple-rows
[{:VARIANTS "FLASH, SWAP.", :STANDARD "Memory: FLASH"}
{:VARIANTS "FLASH, , ,, SWAP.", :STANDARD "Memory: FLASH"}]) ;; This is my input. The input equivalent to a data frame with 2 column of "STANDARD", and "VARIANTS"
(defn variants-decomposed [a_map_raw] ;; process each row of the input data
(if-let [variants (:VARIANTS a_map_raw)]
(if (clojure.string/blank? variants)
[{:STANDARD (:STANDARD a_map_raw), :VARIANT nil}]
(let [standard (:STANDARD a_map_raw)
splitted (-> (clojure.string/split variants #"[,]")
((fn [list-variant] (map #(clojure.string/trim %) list-variant)), )
((fn [list-variant] (filter #(not (clojure.string/blank? %)) list-variant)), ))]
(if (seq splitted) ;; not empty
(for [v splitted] {:STANDARD standard, :VARIANT v})
[{:STANDARD standard, :VARIANT nil}]
)))
[{:VARIANT nil, :STANDARD (:STANDARD a_map_raw)}])
)
(defn multiple-variant-maps [map_of_variants] ;; the processing to each row and collect the result
(-> (map variants-decomposed map_of_variants)
((fn [list-of-vectors] (apply concat list-of-vectors)), )))))
(multiple-variant-maps Multple-rows) ;; This is my required result, which is equivalent to a data frame of 2 columns of "STANDARD", and "VARIANT".
以上计算结果如下:
({:STANDARD "Memory: FLASH", :VARIANT "FLASH"}
{:STANDARD "Memory: FLASH", :VARIANT "SWAP."}
{:STANDARD "Memory: FLASH", :VARIANT "FLASH"}
{:STANDARD "Memory: FLASH", :VARIANT "SWAP."})
我希望我能在 R 中以惯用的方式做同样的事情。到目前为止,我已经努力搞定了以下内容,但它仍然没有处理空白变体等不规则问题
dictionary.cleaned <- function(t) {
variants.splitted <- sapply(data.frame(do.call('rbind', strsplit(t[, "VARIANTS"], "[,]"))), str_trim)
melted <- melt(data.frame(dplyr::select(t, -VARIANTS), variants.splitted), id.vars = "STANDARD")
colnames(melted)[colnames(melted)== "value"] <- "VARIANT"
melted
}
以上 R 代码的结果如下:
> dictionary.cleaned(Multiple_rows)
STANDARD variable VARIANT
1 Memory: FLASH X1 FLASH
2 Memory: FLASH X1 FLASH
3 Memory: FLASH X2 SWAP.
4 Memory: FLASH X2
5 Memory: FLASH X3 FLASH
6 Memory: FLASH X3
7 Memory: FLASH X4 SWAP.
8 Memory: FLASH X4
9 Memory: FLASH X5 FLASH
10 Memory: FLASH X5 SWAP.
我想学习在 R 中更流利地编程 list/vector,R 相当于列表理解,相当于列表连接,以及正确地将列表表达式转换为数据框。
或者我可能需要学习 R 处理如此复杂或不规则数据的范例。 (R 在处理结构整齐的向量方面非常优雅。)
或者,我应该为正确的工作使用正确的工具,这样较低级别的数据争论可能不适合 R?
感谢您的帮助或指点!
于
试试这个:
# replace spaces with blanks in VARIANTS column
Multiple_rows$VARIANTS <- gsub(" ", "", as.character(Multiple_rows$VARIANTS))
# replace repeated commas with a single comma
Multiple_rows$VARIANTS <- gsub(",+", ",", as.character(Multiple_rows$VARIANTS))
VARIANTS <- unlist(strsplit(Multiple_rows$VARIANTS, ","))
STANDARD <- rep(Multiple_rows$STANDARD,
sapply(strsplit(Multiple_rows$VARIANTS, ","), length))
Multiple_rows <- data.frame(VARIANTS, STANDARD)
# VARIANTS STANDARD
#1 FLASH Memory: FLASH
#2 SWAP. Memory: FLASH
#3 FLASH Memory: FLASH
#4 SWAP. Memory: FLASH
这里有两个可供考虑的备选方案。
第一个使用我的 "splitstackshape" 包中的 cSplit
。它returns一个data.table
:
library(splitstackshape)
cSplit(Multiple_rows, "VARIANTS", ",", "long")[VARIANTS != ""]
# VARIANTS STANDARD
# 1: FLASH Memory: FLASH
# 2: SWAP. Memory: FLASH
# 3: FLASH Memory: FLASH
# 4: SWAP. Memory: FLASH
第二个使用 "dplyr" 和 "tidyr",加载 "stringi" 以修剪字符串:
library(dplyr)
library(tidyr)
library(stringi)
Multiple_rows %>%
mutate(VARIANTS = lapply(strsplit(as.character(VARIANTS), ","), stri_trim)) %>%
unnest(VARIANTS) %>%
filter(VARIANTS != "")
# VARIANTS STANDARD
# 1 FLASH Memory: FLASH
# 2 SWAP. Memory: FLASH
# 3 FLASH Memory: FLASH
# 4 SWAP. Memory: FLASH