Normalization/transformation 在使用 Box-Cox 进行 PCA 之前

Normalization/transformation prior to PCA with Box-Cox

在计算 PCA 之前,我需要标准化我的数据。我有一个矩阵,其中行名称代表疾病组(0 代表对照,1 是溃疡性结肠炎,2 是克罗恩病)。其余数据代表基因表达值。

我尝试过未规范化的对数转换(通过绘制某些列的直方图以及通过 Anderson-Darling 测试确认)。

更新:我正在尝试 Box-Cox 转换。我不确定如何在使用以下内容之前将我的值矩阵转换为线性模型 class(其中 lm 将被我的数据替换)。我知道 lm 公式必须采用响应~项的形式,其中项指定响应的线性预测变量。

      bc=boxcox(Gene1 ~ 1, lambda=seq(-2, 2))  (as suggested in comments). 

不确定是否需要将术语变量更改为疾病(将疾病列添加到数据后)。

         bc=boxcox(Gene1 ~ disease , lambda=seq(-2,2))

         best.lam=bc$x[which(bc$y==max(bc$y))]

共有 24 行 13 列。我如何轻松地将转换应用于数据集中的每一列?

首先,我不确定如何快速线性化每一列。当您 ?lm 时,它指出如果响应变量是矩阵,那么您可以在计算 boxcox 之前使用 model.matrix 将线性模型拟合到各个列。但是,在线或 R 帮助中没有此示例。

其次,我不确定我将如何通过相应的 lambda 快速更改每列的 y 值(可能是 for 循环或使用其中一个应用函数)。

请在下面找到我的新数据。真实的东西包含 600 多个基因和 190 行。任何进一步的帮助将不胜感激。

     structure(c(5.54e-05, 5.58e-06, 9.74e-05, 1.33e-06, 1.29e-05, 
     7.22e-06, 0.000215899, 3.6e-06, 0.000146724, 1.53e-05, 0.000913187, 
     1.9e-06, 0.007421464, 0.000648006, 5.1e-06, 6.15e-06, 4.73e-06, 
     0.000119899, 0.000884487, 0.000850632, 0.000236607, 7.36e-06, 
     8.48e-06, 2.63e-05, 0.001368493, 1.12e-05, 0.000177568, 0.006338532, 
     0.006162866, 0.040695132, 0.013255055, 0.033086619, 0.074158811, 
     0.004967497, 0.01247423, 0.043201417, 0.011470285, 0.038447751, 
     0.018825124, 0.027701807, 0.063373762, 0.005374513, 0.048876252, 
     0.009959848, 0.004434078, 0.004176856, 0.015288913, 0.060226053, 
     0.05128922, 0.006557554, 0.017460326, 0.007684784, 0.002107577, 
     0.005773192, 0.076186393, 0.037631043, 0.052159393, 0.012179365, 
     0.047199766, 0.022458838, 0.030261613, 0.00626629, 0.028664896, 
     0.02285845, 0.02801855, 0.017681676, 0.040563592, 0.029791175, 
     0.034778056, 0.019318473, 0.011847912, 0.009614177, 0.064027542, 
     0.035334149, 0.041638955, 0.056015014, 0.03304865, 0.017660205, 
     0.030187166, 0.057919531, 0.029990489, 0.000112884, 0.000920886, 
     0.001081748, 0.000195159, 0.001678445, 0.000171612, 0.000191702, 
     0.000560035, 0.000384056, 0.000454783, 0.000723385, 0.000203897, 
     0.000973337, 0.000822171, 0.000620526, 0.000260769, 0.000214607, 
     0.002077443, 0.00065843, 0.000403672, 0.000378651, 0.000409306, 
     0.001722587, 0.000213785, 0.000176643, 0.002022878, 0.001886929, 
     0.053029236, 0.022594965, 0.011967636, 0.026851113, 0.03773798, 
     0.031356268, 0.10410326, 0.063265216, 0.018028454, 0.116038001, 
     0.00572817, 0.053635968, 0.059126941, 0.011835241, 0.004639624, 
     0.014302911, 0.082948853, 0.015202238, 0.021295431, 0.043342, 
     0.008153675, 0.015613747, 0.043289609, 0.048834321, 0.019144763, 
     0.059809871, 0.006990685, 0.04082966, 0.02986135, 0.061405171, 
     0.006142619, 0.009767602, 0.035427993, 0.03729329, 0.01309739, 
     0.00221718, 0.040211393, 0.006303841, 0.030146612, 0.032033879, 
     0.024590398, 0.077991721, 0.017215666, 0.014731147, 0.04802582, 
     0.03168714, 0.03244771, 0.032278613, 0.017301885, 0.013450667, 
     0.040207755, 0.042669615, 0.03456749, 0.034631319, 1.93e-05, 
     4.72e-06, 5.41e-05, 0, 1.91e-05, 9.33e-07, 5.98e-06, 0, 1.05e-06, 
     4.1e-07, 7.72e-05, 4.07e-07, 0.000585154, 0.000246992, 7.86e-06, 
     3.13e-06, 2.14e-06, 7.56e-06, 9.29e-05, 0.000116024, 5.51e-05, 
     7.79e-06, 6.65e-06, 2.06e-06, 0.000104342, 4.16e-06, 1.27e-05, 
     0.000197502, 0.00015135, 0.000107306, 6.54e-05, 0.000225564, 
     0.000142631, 0.000168873, 3.5e-05, 0.000365242, 0.000174254, 
     0.000339327, 8.7e-05, 0.000136679, 0.000156634, 0.000224181, 
     0.000205305, 8.87e-05, 0.000305774, 0.000133615, 0.00015118, 
     0.000107229, 0.000162579, 0.000152249, 6.88e-05, 0.000113864, 
     0.000249258, 0.00024256, 0.00079296, 0.007640951, 0.004937327, 
     0.000422361, 0.000953513, 0.000951187, 0.000671306, 0.001106406, 
     0.002606568, 0.003006867, 0.001911646, 0.00135411, 0.012461738, 
     0.000434917, 0.00237646, 0.007857561, 0.000436889, 0.00048816, 
     0.000348146, 0.000931449, 0.000323974, 0.004945321, 0.000693845, 
     0.000479572, 0.000843415, 0.001419675, 0.001547478, 8.16e-05, 
     6.63e-05, 0.000101583, 3.08e-05, 0.000147039, 5.13e-05, 0.000109479, 
     2.39e-05, 0.000225475, 4.28e-05, 0.000230785, 2.1e-05, 0.0001356, 
     0.000124173, 0.000245128, 0.000275446, 3.18e-05, 0.00017516, 
     0.000180192, 0.000246669, 0.000378708, 4.35e-05, 0.000267824, 
     7.2e-05, 7.65e-05, 8.79e-05, 0.000130026, 0.000111462, 3.17e-05, 
     0.000200096, 3.12e-06, 8.75e-05, 3.11e-06, 6.89e-06, 0.000165936, 
     5.98e-05, 0.000201355, 5.92e-06, 2.57e-05, 2.53e-05, 3.27e-05, 
     0.000137446, 0.000134402, 5.86e-07, 3.9e-05, 0.018886909, 0.050343466, 
     4.15e-05, 1.67e-05, 0.000172614, 4.95e-05, 1.27e-05, 9.85e-05, 
     4.28e-05, 0.002708402, 0.003215586, 0.00457116, 0.001713549, 
     0.024353184, 0.006660748, 0.003198887, 0.003094386, 0.004789163, 
     0.002816955, 0.021587313, 0.002084725, 0.00378062, 0.021751495, 
     0.009097143, 0.012216225, 0.001125765, 0.013043534, 0.005514773, 
     0.008323962, 0.026898764, 0.002149135, 0.008021623, 0.006673567, 
     0.005391139, 0.018578559, 0.013786297, 0.00080595, 0.001289505, 
     0.002451416, 0.000234107, 0.001694733, 0.000288175, 0.002357478, 
     0.000856129, 0.00159752, 0.000117538, 0.000166581, 0.000367288, 
     0.001039841, 0.001779528, 0.000438092, 0.001012515, 0.000529936, 
     0.003193086, 0.002562702, 0.00277401, 0.003013136, 0.001349197, 
     0.001646296, 0.001114222, 0.001207882, 0.002804949, 0.000366419
     ), .Dim = c(27L, 13L), .Dimnames = list(c("2", "0", "0", "0", 
    "1", "0", "0", "1", "1", "1", "2", "0", "0", "1", "2", "2", "1", 
    "2", "2", "2", "2", "1", "1", "2", "2", "0", "0"), c("Gene1", 
    "Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7", "Gene8", 
    "Gene9", "Gene10", "Gene11", "Gene12", "Gene13")))

插入符可能会使这更容易。

正在创建数据结构

data <- structure(c(5.54e-05, 5.58e-06, 9.74e-05, 1.33e-06, 1.29e-05, 
            7.22e-06, 0.000215899, 3.6e-06, 0.000146724, 1.53e-05, 0.000913187, 
            1.9e-06, 0.007421464, 0.000648006, 5.1e-06, 6.15e-06, 4.73e-06, 
            0.000119899, 0.000884487, 0.000850632, 0.000236607, 7.36e-06, 
            8.48e-06, 2.63e-05, 0.001368493, 1.12e-05, 0.000177568, 0.006338532, 
            0.006162866, 0.040695132, 0.013255055, 0.033086619, 0.074158811, 
            0.004967497, 0.01247423, 0.043201417, 0.011470285, 0.038447751, 
            0.018825124, 0.027701807, 0.063373762, 0.005374513, 0.048876252, 
            0.009959848, 0.004434078, 0.004176856, 0.015288913, 0.060226053, 
            0.05128922, 0.006557554, 0.017460326, 0.007684784, 0.002107577, 
            0.005773192, 0.076186393, 0.037631043, 0.052159393, 0.012179365, 
            0.047199766, 0.022458838, 0.030261613, 0.00626629, 0.028664896, 
            0.02285845, 0.02801855, 0.017681676, 0.040563592, 0.029791175, 
            0.034778056, 0.019318473, 0.011847912, 0.009614177, 0.064027542, 
            0.035334149, 0.041638955, 0.056015014, 0.03304865, 0.017660205, 
            0.030187166, 0.057919531, 0.029990489, 0.000112884, 0.000920886, 
            0.001081748, 0.000195159, 0.001678445, 0.000171612, 0.000191702, 
            0.000560035, 0.000384056, 0.000454783, 0.000723385, 0.000203897, 
            0.000973337, 0.000822171, 0.000620526, 0.000260769, 0.000214607, 
            0.002077443, 0.00065843, 0.000403672, 0.000378651, 0.000409306, 
            0.001722587, 0.000213785, 0.000176643, 0.002022878, 0.001886929, 
            0.053029236, 0.022594965, 0.011967636, 0.026851113, 0.03773798, 
            0.031356268, 0.10410326, 0.063265216, 0.018028454, 0.116038001, 
            0.00572817, 0.053635968, 0.059126941, 0.011835241, 0.004639624, 
            0.014302911, 0.082948853, 0.015202238, 0.021295431, 0.043342, 
            0.008153675, 0.015613747, 0.043289609, 0.048834321, 0.019144763, 
            0.059809871, 0.006990685, 0.04082966, 0.02986135, 0.061405171, 
            0.006142619, 0.009767602, 0.035427993, 0.03729329, 0.01309739, 
            0.00221718, 0.040211393, 0.006303841, 0.030146612, 0.032033879, 
            0.024590398, 0.077991721, 0.017215666, 0.014731147, 0.04802582, 
            0.03168714, 0.03244771, 0.032278613, 0.017301885, 0.013450667, 
            0.040207755, 0.042669615, 0.03456749, 0.034631319, 1.93e-05, 
            4.72e-06, 5.41e-05, 0, 1.91e-05, 9.33e-07, 5.98e-06, 0, 1.05e-06, 
            4.1e-07, 7.72e-05, 4.07e-07, 0.000585154, 0.000246992, 7.86e-06, 
            3.13e-06, 2.14e-06, 7.56e-06, 9.29e-05, 0.000116024, 5.51e-05, 
            7.79e-06, 6.65e-06, 2.06e-06, 0.000104342, 4.16e-06, 1.27e-05, 
            0.000197502, 0.00015135, 0.000107306, 6.54e-05, 0.000225564, 
            0.000142631, 0.000168873, 3.5e-05, 0.000365242, 0.000174254, 
            0.000339327, 8.7e-05, 0.000136679, 0.000156634, 0.000224181, 
            0.000205305, 8.87e-05, 0.000305774, 0.000133615, 0.00015118, 
            0.000107229, 0.000162579, 0.000152249, 6.88e-05, 0.000113864, 
            0.000249258, 0.00024256, 0.00079296, 0.007640951, 0.004937327, 
            0.000422361, 0.000953513, 0.000951187, 0.000671306, 0.001106406, 
            0.002606568, 0.003006867, 0.001911646, 0.00135411, 0.012461738, 
            0.000434917, 0.00237646, 0.007857561, 0.000436889, 0.00048816, 
            0.000348146, 0.000931449, 0.000323974, 0.004945321, 0.000693845, 
            0.000479572, 0.000843415, 0.001419675, 0.001547478, 8.16e-05, 
            6.63e-05, 0.000101583, 3.08e-05, 0.000147039, 5.13e-05, 0.000109479, 
            2.39e-05, 0.000225475, 4.28e-05, 0.000230785, 2.1e-05, 0.0001356, 
            0.000124173, 0.000245128, 0.000275446, 3.18e-05, 0.00017516, 
            0.000180192, 0.000246669, 0.000378708, 4.35e-05, 0.000267824, 
            7.2e-05, 7.65e-05, 8.79e-05, 0.000130026, 0.000111462, 3.17e-05, 
            0.000200096, 3.12e-06, 8.75e-05, 3.11e-06, 6.89e-06, 0.000165936, 
            5.98e-05, 0.000201355, 5.92e-06, 2.57e-05, 2.53e-05, 3.27e-05, 
            0.000137446, 0.000134402, 5.86e-07, 3.9e-05, 0.018886909, 0.050343466, 
            4.15e-05, 1.67e-05, 0.000172614, 4.95e-05, 1.27e-05, 9.85e-05, 
            4.28e-05, 0.002708402, 0.003215586, 0.00457116, 0.001713549, 
            0.024353184, 0.006660748, 0.003198887, 0.003094386, 0.004789163, 
            0.002816955, 0.021587313, 0.002084725, 0.00378062, 0.021751495, 
            0.009097143, 0.012216225, 0.001125765, 0.013043534, 0.005514773, 
            0.008323962, 0.026898764, 0.002149135, 0.008021623, 0.006673567, 
            0.005391139, 0.018578559, 0.013786297, 0.00080595, 0.001289505, 
            0.002451416, 0.000234107, 0.001694733, 0.000288175, 0.002357478, 
            0.000856129, 0.00159752, 0.000117538, 0.000166581, 0.000367288, 
            0.001039841, 0.001779528, 0.000438092, 0.001012515, 0.000529936, 
            0.003193086, 0.002562702, 0.00277401, 0.003013136, 0.001349197, 
            0.001646296, 0.001114222, 0.001207882, 0.002804949, 0.000366419
), .Dim = c(27L, 13L), .Dimnames = list(c("2", "0", "0", "0", 
                                          "1", "0", "0", "1", "1", "1", "2", "0", "0", "1", "2", "2", "1", 
                                          "2", "2", "2", "2", "1", "1", "2", "2", "0", "0"), c("Gene1", 
                                                                                               "Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7", "Gene8", 
                                                                                               "Gene9", "Gene10", "Gene11", "Gene12", "Gene13")))

并转换您的数据。

library(caret)

#estimate a Box–Cox transformation 
preProcessValues <- preProcess(data, method = "BoxCox")

#transform data
dataBC <- predict(preProcessValues, data)