R 中的展开列会导致内存不足

Question

我有一个调查表，我需要将这个数据集分组到一行中，但是我在使用 spread 和 group 时遇到了一些问题。

我的数据集具有以下格式：数据

country date_   user_id int_id  user_name   ext_name    q_order questions   answers
AR  2019    AR-100  XP200   jhon foo    damian, khon    1   Question1 … yes
AR  2019    AR-100  XP200   jhon foo    damian, khon    2   Question2 … 0
AR  2019    AR-100  XP200   jhon foo    damian, khon    3   Question3 … no apply
AR  2019    AR-100  XP200   jhon foo    damian, khon    4   Question4 … 0
AR  2019    AR-100  XP200   jhon foo    damian, khon    5   Question5 … 0
AR  2019    AR-100  XP200   jhon foo    damian, khon    6   Question6 … yes
US  2018    US-100  PP300   Peter fields    jhon voigh  1   Question1 … no
US  2018    US-100  PP300   Peter fields    jhon voigh  2   Question2 … 0
US  2018    US-100  PP300   Peter fields    jhon voigh  3   Question3 … yes apply
US  2018    US-100  PP300   Peter fields    jhon voigh  4   Question4 … 0
US  2018    US-100  PP300   Peter fields    jhon voigh  5   Question5 … 0
US  2018    US-100  PP300   Peter fields    jhon voigh  6   Question6 … no

我尝试对生成的数据集进行分组，但总是得到 14 行而不是 2 行。

代码：

data %>% 
    group_by(country=.$country  ,
             date_ = .$date_,
             medic_id=.$user_id,
             user_id= .$int_id,
             user_name= .$user_name,
             ext_name= .$ext_name,
             q_order=.$q_order
             ) %>% 
    spread(questions, answers)

上面的代码，内存不足。

我什至尝试过 dcast

data %>% 
    select(-q_order) %>% 
    dcast( ...  ~ questions, value.var = "answers")

我得到以下信息：

Country.Code    Created.Date    user_id int_id  user_name   ext_name    Question1 … Question2 … Question3 … Question4 … Question5 … Question6 …
AR  3/28/2019   AR-100  XP200   jhon foo    damian, khon    1   2   0   1   1   1
US  4/28/2019   US-100  PP300   Peter fields    jhon voigh  0   1   1   2   1   2

但我需要：

Country.Code    Created.Date    user_id int_id  user_name   ext_name    Question1 … Question2 … Question3 … Question4 … Question5 … Question6 …
AR  3/28/2019   AR-100  XP200   jhon foo    damian, khon    yes 0   no apply    0   0   yes
US  4/28/2019   US-100  PP300   Peter fields    jhon voigh  no  0   yes apply   0   0   no

为什么 dcast 将 answers 变量的值转换为数值？（我什至尝试过 var.values='answers'）？

我的问题和这个很相似！

但我做不到运行，总是发出内存或生成数值而不是 answers 变量的值。

Answer 1

终于找到答案了！

问题是（我是 R 中的新手），我想在行中获取某些列的值，但是，这些值是字符，而且大多数解决方案处理数字而不是字符！

另一方面，我的解决方案（5 行的示例）与 RESHAPE 配合使用效果很好！，但是对于（小型 -- 中等）真实数据集，我的内存不足（永无止境）。

例如，下一个代码永远不会结束（是的，我也尝试过组，就像我说的那样）

b<-reshape(data=a %>% select(-q_order) ,
           direction="wide",
           idvar = c("Country.Code","Created.Date", "user_id", "int_id", "user_name",
                     "ext_name"),
           timevar="questions" )

这个解决方案运行在 2 秒内：

b<-dcast( a, Country.Code+Created.Date+user_id+int_id +user_name+ ext_name ~ questions,
          toString, value.var="answers")

终于

Country.Code    Created.Date    user_id int_id  user_name   ext_name    Question1 … Question2 … Question3 … Question4 … Question5 … Question6 …
AR  3/28/2019   AR-100  XP200   jhon foo    damian, khon    yes 0   no apply    0   0   yes
US  4/28/2019   US-100  PP300   Peter fields    jhon voigh  no  0   yes apply   0   0   no

R 中的展开列会导致内存不足

Spread columns in R generates an out of memory

r

spread