R split（）函数大小增加问题

Question

我有以下数据集

> head(data)
  X    UserID NPS V3 V4 V5                                   Event              V7          Element                            ElementValue 
1 1 254727216  10  0 19 10 nps.agent.14b.no other attempt was made 10/4/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
2 2 298379949   0  0 28 11 nps.agent.14b.no other attempt was made 9/30/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
3 3 254710917   0  0 20 12 nps.agent.14b.no other attempt was made 9/15/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
4 4 238919392   7  0 17  9 nps.agent.14b.no other attempt was made 9/17/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
5 5 144693025  10  0 18 10 nps.agent.14b.no other attempt was made 9/17/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made
6 6 249978568   5  0 21 12 nps.agent.14b.no other attempt was made 9/18/2014 23:59 cea.element_name nps.agent.14b.no other attempt was made

当我将数据集拆分为：

data_splitted <- split(data,data$UserID)

这里的问题是当我用整个数据集而不是这个样本尝试这个时，大小的巨大增加超过了我的 ram

> format(object.size(data),units="Mb")
[1] "0.2 Mb"
> format(object.size(data_splitted),units="Mb")
[1] "45.7 Mb"

任何有关为什么会发生这种情况的见解以及解决此问题的任何方法都将不胜感激。

Answer 1

试试这个：

data$UserID <- as.character(data$UserID)
data_splitted <- split(data,data$UserID)

你的情况是因为ID是数字，这个数字被用作创建列表中的索引（位置），这显然是不对的。由于 id 的数量非常多，R 用尽可能多的空列表填补了空白（因此对象大小很大）。通过将 id 设为字符变量，我们可以避免这种情况。

另一种在 1 行数据帧中保持 id 变量不变的方法是：

data_splitted <- list()
for(i in 1:nrow(data))
  data_splitted[[as.character(data$UserID[i])]] <- data[i,]

要访问新创建的列表中的元素，如果使用 $ 运算符，则需要引用数字：

data_splitted$"144693025"
data_splitter[["144693025"]]

另一种选择 是在数字 ID 前添加字符。例如：

data$UserID <- paste0("id",data$UserID)
data_splitted <- split(data,data$UserID)

这使得访问列表项更方便：

data_splitted$id144693025
data_splitted$id238919392

Answer 2

如果您有很多相似的字符串，请使用因子而不是字符串。（如果您不需要处理它们的内容，则根本不要存储它们，或者只存储例如主机名，再次作为因素。您可以将 grep 与正则表达式一起使用，并且只使用捕获字段，例如主机名和错误代码，并丢弃其他所有内容）。

接下来，通过更改或后处理您的日志文件，让您的分裂生活变得轻松，来自：

nps.agent.14b.no other attempt was made

至：

nps.agent.14b:no other attempt was made

现在您只需拆分“:”（或“|”）查看日志文件的一些最佳实践，上面写了很多好东西。如果保证每一行都有一个且只有一个主机名和一个错误代码，则可以将它们存储为单独的主机名和错误代码字段。

因此，您的代码应尽可能简单：

> as.factor(strsplit(s, ':')
[1] 'nps.agent.14b'             'no other attempt was made'

同样，如果您不需要处理 'no other attempt was made'，则不要存储它。或者您的日志文件消息可以将其压缩为 'NEA'。或者，如果它不传达任何额外信息，就把它扔掉。

我建议您重新审视您的日志文件格式，并积极地使其尽可能简洁和信息丰富。

R split（）函数大小增加问题

R split() function size increase issue

memory

text-processing

r

logfile-analysis

categorical-data