如何从 R 中的 PTAK 包中为 PARAFAC 模型重新排列数据

Question

我最初是运行 PCA，用于将大量相关度量（> 10 种行为）减少到更少的变量（在 PCA 中我使用了第一和第二主成分）。但这是不合适的（类似) because we have repeated measures from the same individuals over time (Budaev 2010, pg. 6: "Using multiple measures from the same individuals as independent while computing the correlation matrix is pseudoreplication and is incorrect."). Because of this, it is recommended I use a PARAFAC model instead of PCA to do this (available through the PTAk package in R) - see Leibovici (2010)的情况。

我的数据存储为一个 data.frame 对象，其中每一行代表一个人，可以在一年内和他们的一生中多次采样。

我的数据样本（可用数据here）：

individual  beh1   beh2     beh3   beh4    year
11979       0      0.0333   0      0       2014
12026       0.176  0.0882   0.441  0.0882  2014
12435       0.405  0.189    0      0.243   2014
12524       0      0        1      0       2014
12625       0      0        0      0       2014
12678       0      0        0      0       2014

要使用PTAk包，需要将数据转换成array。执行此操作的代码是：

my_df <- array(as.vector(as.matrix(subset_data), c(x, y, z))

其中x是行数，y是列数，z是数组数。

我的一般问题：

Which components of my data.frame should correspond to which measures in the array?

我最初的猜测是 x 应该对应于抽样的个体数量（即原始 data.frame 中的行数），但我不确定 y 和 z 组件应该是。

像这样：

my_df <- array(as.vector(as.matrix(subset_data)), c(5393, 4, 9))

其中x是5393个人，y是变量的数量（例如4种行为），z是年数（9年）。

这会生成 9 arrays，每个人的记录作为行，每个变量作为列（标识符、4 种行为和抽样年份）。理论上每个数组都会对应某一年的采样，但目前情况并非如此。

我的问题详细：

If this is the correct formatting for my array, how do I ensure that only one year of sampling data is included in each array (i.e., only samples from 2008 are in array 1, only 2009 in array 2, etc.)?

或者，如果我的格式有误，我的数据和问题的正确 array 格式是什么？

例如，我是不是应该根据行为（beh1、beh2等）将数据分组到arrays中，所以代码看起来像：

my_df<-array(as.vector(as.matrix(subset_data)), c(5393, 3, 4))

每个 array 将有三列对应于标识符、行为值和观察年份？如果这是正确的格式，我如何确保 arrays 是根据行为而不是标识符 and/or 年份列来划分的？

Answer 1

首先，在你的 subset_data 中，变量 individual 和 year 需要被丢弃（或在行名中使用），因为它们只是标识符，否则在你的 'as.vector(subset_data)' 他们会将它们与数据混淆：所以使用 as.vector(subset_data[,-c(1,4)])

然后，看下面的小例子： A=matrix(1:6,c(2,3))

as.vector(A)是 [1] 1 2 3 4 5 6

所以，想象一下 2 个人 3 种有效的行为！

在构建 A 时，dim(A)[1] (2) 比扩展到数组的 dim(A)[2] (3) 运行得更快。

所以现在假设有 4 年 X[,,1] 是你的第一年 A： X<-array(0,c(2,3,4)); X[,,1]=A; X[,,2]=A*2; X[,,3]=A*10、X[,,4]=A/10

请注意，这可能是构建您的 my_df

的一种方式

my_df[,,1]<-subset_data[ subset_data[,4]==2014, -c(1,4) ]等等

我的观点是as.vector(X)那么

1 2 3 4 5 6 2 4 6 8 10 12 ...

所以第一年然后第二年等等...

所以要返回（或实际上开始）矩阵 ind x variable 您需要将数据置换为 AA=matrix(aperm(X,c(1,3,2)),c(8,3)) 基本上 8 是 2 个个体乘以 4，有 3 个变量...

所以如果你从那个矩阵开始 AA 你的数组将是 Array(AA,dim=c(2,4,3)) individual x year x var

因此： AA=subset_data[,-c(1,4)]

你需要说 array(AA,dim=c(nb_indi_repeated,9,4)) 9 年和 4 个变量....但是 5393/9 看起来你没有对所有个体进行完全精确的重复。因此，您需要 select 重复个体的 'best sample' 来定义年份和 selected 个体，或者估计缺失值，或者做一些完全不同的事情！这可能不是从几年而是从一系列重复的措施来定义重复，下一个是在同一年或以后......

如何从 R 中的 PTAK 包中为 PARAFAC 模型重新排列数据

How to rearrange your data in an array for PARAFAC model from PTAK package in R

arrays

r

pca