R 将 BigCartel csv 文件中的列拆分为数据框中的长格式或 data.table

R split column in BigCartel csv file into long format in dataframe or data.table

Big Cartel 有一个选项可以将订单导出到 csv 文件中。但是结构不太适合我需要做的分析。

这是来自 Big cartel csv 订单下载的列和行的子集(还有其他列对手头的问题不重要)。

Number, Buyer name,Items,Item count,Item total,Total price,Total shipping,Total tax,Total discount
1,jim,product_name:Plate|product_option_name:Red|quantity:1|price:9.99|total:9.99,1,9.99,11.98,1.99,0,0
2,bill,product_name:Plate|product_option_name:Green|quantity:1|price:9.99|total:9.99;product_name:Plate|product_option_name:Blue|quantity:1|price:9.99|total:9.99,2,19.98,22.98,3,0,0
3,jane,product_name:Plate|product_option_name:Red|quantity:1|price:6.99|total:6.99;product_name:Thingy|product_option_name:|quantity:1|price:9.99|total:9.99;product_name:Mug|product_option_name:Grey|quantity:1|price:10.99|total:10.99;product_name:Cup|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Saucer|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Stopper|product_option_name:|quantity:1|price:9.99|total:9.99,6,57.94,64.94,7,0,0
4,dale,product_name:Plate|product_option_name:Green|quantity:1|price:10.99|total:10.99,1,10.99,13.99,4.99,0,1.99

项目列可以有多个 "line-items",以分号 (;) 作为分隔符。每个 "line-item" 有五个属性,用竖线 (|) 分隔,即 product_name、product_option_name、数量、价格和总计(即对于行)。有一列 "Item count",它给出了 "line-items" 的数量加上(订单)总价、运费、税收和折扣的列。对于分析,我想要以下长格式的数据,其中运费、税金和折扣也被视为 'product items'。

Number Buyer name line-item    product_option_name quantity price total
1      jim        Plate        Red                 1        9.99  9.99
1      jim        shipping                         1        1.99  1.99
1      jim        tax                              0        0     0
1      jim        discount                         0        0     0
2      bill       Plate        Green               1        9.99  9.99
2      bill       Plate        Blue                1        9.99  9.99
2      bill       shipping                         1        3     3
2      bill       tax                              0        0     0
2      bill       discount                         0        0     0
3      jane       Plate        Red                 1        6.99  6.99
3      jane       Thingy                           1        9.99  9.99
3      jane       Mug          Grey                1        10.99 10.99
3      jane       Cup          Grey                1        9.99  9.99
3      jane       Saucer       Grey                1        9.99  9.99
3      jane       Stopper                          1        9.99  9.99
3      jane       shipping                         1        7     7
3      jane       tax                              0        0     0
3      jane       discount                         0        0     0
4      dale       Plate        Green               1        10.99 10.99
4      dale       shipping                         1        4.99  4.99
4      dale       tax                              0        0     
4      dale       discount                         0        -1.99 -1.99

使用 r:data.table 中的 tstrsplit() 和 r:splitstackshape 中的 cSplit() 似乎是解决方案,但我无法获得正确的语法。我也尝试了 tidyverse/dplyr 函数 separate/spread 等,但我无法得到我需要的输出。

我一直在谷歌搜索和搜索所有 SO 问题 - 有一些解决方案(这个 )很接近,但 none 很适合我,因为大多数人认为范围很广'格式而不是 'long'。

像这样的东西可能会为您找到想要的东西。

library(dplyr)
library(tidyr)
library(stringr)

filepath <- # Path to datafile here

df <- read.csv(filepath, stringsAsFactors = FALSE)

cols <- paste0("col", 1:(max(str_count(df$Items, ";")) + 1))

df <- df %>%
      separate(col = Items, into = cols, sep = ";", fill = "right") %>%
      gather_("column", "details", cols, na.rm = TRUE) %>%
      select(-column) %>%
      separate(col = details, into = c("product_name", "product_option_name","quantity","price","total"), sep = "\|", fill = "right") %>%
      mutate(product_name = sub("^.*\:", "", product_name),
             product_option_name = sub("^.*\:", "", product_option_name),
             quantity = sub("^.*\:", "", quantity),
             price = sub("^.*\:", "", price),
             total = sub("^.*\:", "", total)) %>%
      gather("line", "item", c(Total.shipping, Total.discount, Total.tax, product_name)) %>%
      mutate(product_option_name = ifelse(line == "product_name" & product_option_name != "", product_option_name, NA),
             line_item = ifelse(line == "product_name", item, sub("^.*\.","", line)),
             price = ifelse(line == "product_name", price, item),
             price = ifelse(line_item == "discount", as.numeric(price) * (-1), price),
             quantity = ifelse(line_item %in% c("shipping","discount","tax") & price == "0", 0, quantity),
             total = as.numeric(price) * as.numeric(quantity)) %>%
      distinct() %>%
      select(Number, Buyer.name, line_item, product_option_name, quantity, price, total) %>%
      arrange(Number)