R 将 BigCartel csv 文件中的列拆分为数据框中的长格式或 data.table
R split column in BigCartel csv file into long format in dataframe or data.table
Big Cartel 有一个选项可以将订单导出到 csv 文件中。但是结构不太适合我需要做的分析。
这是来自 Big cartel csv 订单下载的列和行的子集(还有其他列对手头的问题不重要)。
Number, Buyer name,Items,Item count,Item total,Total price,Total shipping,Total tax,Total discount
1,jim,product_name:Plate|product_option_name:Red|quantity:1|price:9.99|total:9.99,1,9.99,11.98,1.99,0,0
2,bill,product_name:Plate|product_option_name:Green|quantity:1|price:9.99|total:9.99;product_name:Plate|product_option_name:Blue|quantity:1|price:9.99|total:9.99,2,19.98,22.98,3,0,0
3,jane,product_name:Plate|product_option_name:Red|quantity:1|price:6.99|total:6.99;product_name:Thingy|product_option_name:|quantity:1|price:9.99|total:9.99;product_name:Mug|product_option_name:Grey|quantity:1|price:10.99|total:10.99;product_name:Cup|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Saucer|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Stopper|product_option_name:|quantity:1|price:9.99|total:9.99,6,57.94,64.94,7,0,0
4,dale,product_name:Plate|product_option_name:Green|quantity:1|price:10.99|total:10.99,1,10.99,13.99,4.99,0,1.99
项目列可以有多个 "line-items",以分号 (;) 作为分隔符。每个 "line-item" 有五个属性,用竖线 (|) 分隔,即 product_name、product_option_name、数量、价格和总计(即对于行)。有一列 "Item count",它给出了 "line-items" 的数量加上(订单)总价、运费、税收和折扣的列。对于分析,我想要以下长格式的数据,其中运费、税金和折扣也被视为 'product items'。
Number Buyer name line-item product_option_name quantity price total
1 jim Plate Red 1 9.99 9.99
1 jim shipping 1 1.99 1.99
1 jim tax 0 0 0
1 jim discount 0 0 0
2 bill Plate Green 1 9.99 9.99
2 bill Plate Blue 1 9.99 9.99
2 bill shipping 1 3 3
2 bill tax 0 0 0
2 bill discount 0 0 0
3 jane Plate Red 1 6.99 6.99
3 jane Thingy 1 9.99 9.99
3 jane Mug Grey 1 10.99 10.99
3 jane Cup Grey 1 9.99 9.99
3 jane Saucer Grey 1 9.99 9.99
3 jane Stopper 1 9.99 9.99
3 jane shipping 1 7 7
3 jane tax 0 0 0
3 jane discount 0 0 0
4 dale Plate Green 1 10.99 10.99
4 dale shipping 1 4.99 4.99
4 dale tax 0 0
4 dale discount 0 -1.99 -1.99
使用 r:data.table 中的 tstrsplit() 和 r:splitstackshape 中的 cSplit() 似乎是解决方案,但我无法获得正确的语法。我也尝试了 tidyverse/dplyr 函数 separate/spread 等,但我无法得到我需要的输出。
我一直在谷歌搜索和搜索所有 SO 问题 - 有一些解决方案(这个 )很接近,但 none 很适合我,因为大多数人认为范围很广'格式而不是 'long'。
像这样的东西可能会为您找到想要的东西。
library(dplyr)
library(tidyr)
library(stringr)
filepath <- # Path to datafile here
df <- read.csv(filepath, stringsAsFactors = FALSE)
cols <- paste0("col", 1:(max(str_count(df$Items, ";")) + 1))
df <- df %>%
separate(col = Items, into = cols, sep = ";", fill = "right") %>%
gather_("column", "details", cols, na.rm = TRUE) %>%
select(-column) %>%
separate(col = details, into = c("product_name", "product_option_name","quantity","price","total"), sep = "\|", fill = "right") %>%
mutate(product_name = sub("^.*\:", "", product_name),
product_option_name = sub("^.*\:", "", product_option_name),
quantity = sub("^.*\:", "", quantity),
price = sub("^.*\:", "", price),
total = sub("^.*\:", "", total)) %>%
gather("line", "item", c(Total.shipping, Total.discount, Total.tax, product_name)) %>%
mutate(product_option_name = ifelse(line == "product_name" & product_option_name != "", product_option_name, NA),
line_item = ifelse(line == "product_name", item, sub("^.*\.","", line)),
price = ifelse(line == "product_name", price, item),
price = ifelse(line_item == "discount", as.numeric(price) * (-1), price),
quantity = ifelse(line_item %in% c("shipping","discount","tax") & price == "0", 0, quantity),
total = as.numeric(price) * as.numeric(quantity)) %>%
distinct() %>%
select(Number, Buyer.name, line_item, product_option_name, quantity, price, total) %>%
arrange(Number)
Big Cartel 有一个选项可以将订单导出到 csv 文件中。但是结构不太适合我需要做的分析。
这是来自 Big cartel csv 订单下载的列和行的子集(还有其他列对手头的问题不重要)。
Number, Buyer name,Items,Item count,Item total,Total price,Total shipping,Total tax,Total discount
1,jim,product_name:Plate|product_option_name:Red|quantity:1|price:9.99|total:9.99,1,9.99,11.98,1.99,0,0
2,bill,product_name:Plate|product_option_name:Green|quantity:1|price:9.99|total:9.99;product_name:Plate|product_option_name:Blue|quantity:1|price:9.99|total:9.99,2,19.98,22.98,3,0,0
3,jane,product_name:Plate|product_option_name:Red|quantity:1|price:6.99|total:6.99;product_name:Thingy|product_option_name:|quantity:1|price:9.99|total:9.99;product_name:Mug|product_option_name:Grey|quantity:1|price:10.99|total:10.99;product_name:Cup|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Saucer|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Stopper|product_option_name:|quantity:1|price:9.99|total:9.99,6,57.94,64.94,7,0,0
4,dale,product_name:Plate|product_option_name:Green|quantity:1|price:10.99|total:10.99,1,10.99,13.99,4.99,0,1.99
项目列可以有多个 "line-items",以分号 (;) 作为分隔符。每个 "line-item" 有五个属性,用竖线 (|) 分隔,即 product_name、product_option_name、数量、价格和总计(即对于行)。有一列 "Item count",它给出了 "line-items" 的数量加上(订单)总价、运费、税收和折扣的列。对于分析,我想要以下长格式的数据,其中运费、税金和折扣也被视为 'product items'。
Number Buyer name line-item product_option_name quantity price total
1 jim Plate Red 1 9.99 9.99
1 jim shipping 1 1.99 1.99
1 jim tax 0 0 0
1 jim discount 0 0 0
2 bill Plate Green 1 9.99 9.99
2 bill Plate Blue 1 9.99 9.99
2 bill shipping 1 3 3
2 bill tax 0 0 0
2 bill discount 0 0 0
3 jane Plate Red 1 6.99 6.99
3 jane Thingy 1 9.99 9.99
3 jane Mug Grey 1 10.99 10.99
3 jane Cup Grey 1 9.99 9.99
3 jane Saucer Grey 1 9.99 9.99
3 jane Stopper 1 9.99 9.99
3 jane shipping 1 7 7
3 jane tax 0 0 0
3 jane discount 0 0 0
4 dale Plate Green 1 10.99 10.99
4 dale shipping 1 4.99 4.99
4 dale tax 0 0
4 dale discount 0 -1.99 -1.99
使用 r:data.table 中的 tstrsplit() 和 r:splitstackshape 中的 cSplit() 似乎是解决方案,但我无法获得正确的语法。我也尝试了 tidyverse/dplyr 函数 separate/spread 等,但我无法得到我需要的输出。
我一直在谷歌搜索和搜索所有 SO 问题 - 有一些解决方案(这个
像这样的东西可能会为您找到想要的东西。
library(dplyr)
library(tidyr)
library(stringr)
filepath <- # Path to datafile here
df <- read.csv(filepath, stringsAsFactors = FALSE)
cols <- paste0("col", 1:(max(str_count(df$Items, ";")) + 1))
df <- df %>%
separate(col = Items, into = cols, sep = ";", fill = "right") %>%
gather_("column", "details", cols, na.rm = TRUE) %>%
select(-column) %>%
separate(col = details, into = c("product_name", "product_option_name","quantity","price","total"), sep = "\|", fill = "right") %>%
mutate(product_name = sub("^.*\:", "", product_name),
product_option_name = sub("^.*\:", "", product_option_name),
quantity = sub("^.*\:", "", quantity),
price = sub("^.*\:", "", price),
total = sub("^.*\:", "", total)) %>%
gather("line", "item", c(Total.shipping, Total.discount, Total.tax, product_name)) %>%
mutate(product_option_name = ifelse(line == "product_name" & product_option_name != "", product_option_name, NA),
line_item = ifelse(line == "product_name", item, sub("^.*\.","", line)),
price = ifelse(line == "product_name", price, item),
price = ifelse(line_item == "discount", as.numeric(price) * (-1), price),
quantity = ifelse(line_item %in% c("shipping","discount","tax") & price == "0", 0, quantity),
total = as.numeric(price) * as.numeric(quantity)) %>%
distinct() %>%
select(Number, Buyer.name, line_item, product_option_name, quantity, price, total) %>%
arrange(Number)