惯用 R 用于拆分可以拆分为 list/vector 且长度不规则的列,在数据帧或等效项中?

Idiomatic R for splitting a column that may be splitted into list/vector with irregular length, in a dataframe or equivalent?

我正在努力在 R 中实现以下要求:

例如,对于以下数据帧:

Multiple_rows <- data.frame(rbind(c("FLASH, SWAP.", "Memory: FLASH"), c("FLASH, , ,, SWAP.", "Memory: FLASH")))
colnames(Multiple_rows)<- c("VARIANTS", "STANDARD")
Multiple_rows
#           VARIANTS      STANDARD
#1      FLASH, SWAP. Memory: FLASH
#2 FLASH, , ,, SWAP. Memory: FLASH
  1. 对于每一行,拆分 VARIANTS 列的值,其中包含“,”作为分隔符。
  2. 对于字符串的结果列表,对于其中的每个元素,trim 前面和结尾的空白,并过滤掉那些空白 元素。
  3. 使用清理后的列表,为其中的每个元素创建 一行有两列:列 STANDARD 的原始值为 正在处理的行,列 VARIANT 中的元素 题。
  4. 将所有这些新创建的行收集到一个新的 table/dataframe.

因此对于上面的示例,我希望得到以下结果:

#     VARIANT STANDARD      
#1 "FLASH" "Memory FLASH"
#2 "SWAP." "Memory FLASH"
#3 "FLASH" "Memory FLASH"
#4 "SWAP." "Memory FLASH"

行的顺序无关紧要。

以下是我在 Clojure 中的实现(以说明我的要求):

(def Multple-rows 
[{:VARIANTS "FLASH, SWAP.", :STANDARD "Memory: FLASH"}
{:VARIANTS "FLASH, , ,, SWAP.", :STANDARD "Memory: FLASH"}]) ;; This is my input. The input equivalent to a data frame with 2 column of "STANDARD", and "VARIANTS"

(defn variants-decomposed [a_map_raw] ;; process each row of the input data
  (if-let [variants (:VARIANTS a_map_raw)]
    (if (clojure.string/blank? variants)
      [{:STANDARD (:STANDARD a_map_raw), :VARIANT nil}]
      (let [standard (:STANDARD a_map_raw)
            splitted (-> (clojure.string/split variants #"[,]")
                         ((fn [list-variant] (map #(clojure.string/trim %) list-variant)), )
                          ((fn [list-variant] (filter #(not (clojure.string/blank? %)) list-variant)), ))]
        (if (seq splitted) ;; not empty
          (for [v splitted] {:STANDARD standard, :VARIANT v})
          [{:STANDARD standard, :VARIANT nil}]
           )))
    [{:VARIANT nil, :STANDARD (:STANDARD a_map_raw)}])
    )
(defn multiple-variant-maps [map_of_variants] ;; the processing to each row and collect the result
  (-> (map variants-decomposed map_of_variants)
      ((fn [list-of-vectors] (apply concat list-of-vectors)), )))))

(multiple-variant-maps Multple-rows) ;; This is my required result, which is equivalent to a data frame of 2 columns of "STANDARD", and "VARIANT".

以上计算结果如下:

({:STANDARD "Memory: FLASH", :VARIANT "FLASH"} 
{:STANDARD "Memory: FLASH", :VARIANT "SWAP."} 
{:STANDARD "Memory: FLASH", :VARIANT "FLASH"} 
{:STANDARD "Memory: FLASH", :VARIANT "SWAP."})

我希望我能在 R 中以惯用的方式做同样的事情。到目前为止,我已经努力搞定了以下内容,但它仍然没有处理空白变体等不规则问题

dictionary.cleaned <- function(t) {
    variants.splitted <- sapply(data.frame(do.call('rbind', strsplit(t[, "VARIANTS"], "[,]"))), str_trim)
    melted <- melt(data.frame(dplyr::select(t, -VARIANTS), variants.splitted), id.vars = "STANDARD")

    colnames(melted)[colnames(melted)== "value"] <- "VARIANT"
    melted
  }

以上 R 代码的结果如下:

> dictionary.cleaned(Multiple_rows)
        STANDARD variable VARIANT
1  Memory: FLASH       X1   FLASH
2  Memory: FLASH       X1   FLASH
3  Memory: FLASH       X2   SWAP.
4  Memory: FLASH       X2        
5  Memory: FLASH       X3   FLASH
6  Memory: FLASH       X3        
7  Memory: FLASH       X4   SWAP.
8  Memory: FLASH       X4        
9  Memory: FLASH       X5   FLASH
10 Memory: FLASH       X5   SWAP.

我想学习在 R 中更流利地编程 list/vector,R 相当于列表理解,相当于列表连接,以及正确地将列表表达式转换为数据框。

或者我可能需要学习 R 处理如此复杂或不规则数据的范例。 (R 在处理结构整齐的向量方面非常优雅。)

或者,我应该为正确的工作使用正确的工具,这样较低级别的数据争论可能不适合 R?

感谢您的帮助或指点!

试试这个:

# replace spaces with blanks in VARIANTS column
Multiple_rows$VARIANTS <- gsub(" ", "", as.character(Multiple_rows$VARIANTS))
# replace repeated commas with a single comma
Multiple_rows$VARIANTS <- gsub(",+", ",", as.character(Multiple_rows$VARIANTS))

VARIANTS <- unlist(strsplit(Multiple_rows$VARIANTS, ","))
STANDARD <- rep(Multiple_rows$STANDARD, 
                sapply(strsplit(Multiple_rows$VARIANTS, ","), length))

Multiple_rows <- data.frame(VARIANTS, STANDARD)
#  VARIANTS      STANDARD
#1    FLASH Memory: FLASH
#2    SWAP. Memory: FLASH
#3    FLASH Memory: FLASH
#4    SWAP. Memory: FLASH

这里有两个可供考虑的备选方案。

第一个使用我的 "splitstackshape" 包中的 cSplit。它returns一个data.table:

library(splitstackshape)
cSplit(Multiple_rows, "VARIANTS", ",", "long")[VARIANTS != ""]
#    VARIANTS      STANDARD
# 1:    FLASH Memory: FLASH
# 2:    SWAP. Memory: FLASH
# 3:    FLASH Memory: FLASH
# 4:    SWAP. Memory: FLASH

第二个使用 "dplyr" 和 "tidyr",加载 "stringi" 以修剪字符串:

library(dplyr)
library(tidyr)
library(stringi)

Multiple_rows %>%
  mutate(VARIANTS = lapply(strsplit(as.character(VARIANTS), ","), stri_trim)) %>%
  unnest(VARIANTS) %>%
  filter(VARIANTS != "")
#   VARIANTS      STANDARD
# 1    FLASH Memory: FLASH
# 2    SWAP. Memory: FLASH
# 3    FLASH Memory: FLASH
# 4    SWAP. Memory: FLASH