R矢量化矩阵到具有保留因子的数字数据框
R vectorized matrix into numeric data frame with preserving factors
我有以下方式给出的矩阵:
m <- as.matrix(rbind(c("State", "Murder", "Assault", "UrbanPop", "Rape", "Group"),
c("Alabama", 13.2, 236, 58, 21.2, "A"),
c("Alaska", 10.0, 263, 48, 44.5, "A"),
c("Arizona", 8.1, 294, 80, 31.0, "A"),
c("Arkansas", 8.8, 190, 50, 19.5, "A"),
c("California", 9.0, 276, 91, 40.6, "A"),
c("Colorado", 7.9, 204, 78, 38.7, "A"),
c("Connecticut", 3.3, 110, 77, 11.1, "A"),
c("Delaware", 5.9, 238, 72, 15.8, "A"),
c("Florida", 15.4, 335, 80, 31.9, "A"),
c("Georgia", 17.4, 211, 60, 25.8, "A"),
c("Hawaii", 5.3, 46, 83, 20.2, "A"),
c("Idaho", 2.6, 120, 54, 14.2, "A"),
c("Illinois", 10.4, 249, 83, 24.0, "A"),
c("Indiana", 7.2, 113, 65, 21.0, "A"),
c("Iowa", 2.2, 56, 57, 11.3, "A"),
c("Kansas", 6.0, 115, 66, 18.0, "A"),
c("Kentucky", 9.7, 109, 52, 16.3, "A"),
c("Louisiana", 15.4, 249, 66, 22.2, "A"),
c("Maine", 2.1, 83, 51, 7.8, "B"),
c("Maryland", 11.3, 300, 67, 27.8, "B"),
c("Massachusetts", 4.4, 149, 85, 16.3, "B"),
c("Michigan", 12.1, 255, 74, 35.1, "B"),
c("Minnesota", 2.7, 72, 66, 14.9, "B"),
c("Mississippi", 16.1, 259, 44, 17.1, "B"),
c("Missouri", 9.0, 178, 70, 28.2, "B"),
c("Montana", 6.0, 109, 53, 16.4, "B"),
c("Nebraska", 4.3, 102, 62, 16.5, "C"),
c("Nevada", 12.2, 252, 81, 46.0, "C"),
c("New_Hampshire", 2.1, 57, 56, 9.5, "C"),
c("New_Jersey", 7.4, 159, 89, 18.8, "C"),
c("New_Mexico", 11.4, 285, 70, 32.1, "C"),
c("New_York", 11.1, 254, 86, 26.1, "C"),
c("North_Carolina", 13.0, 337, 45, 16.1, "C"),
c("North_Dakota", 0.8, 45, 44, 7.3, "C"),
c("Ohio", 7.3, 120, 75, 21.4, "D"),
c("Oklahoma", 6.6, 151, 68, 20.0, "D"),
c("Oregon", 4.9, 159, 67, 29.3, "D"),
c("Pennsylvania", 6.3, 106, 72, 14.9, "D"),
c("Rhode_Island", 3.4, 174, 87, 8.3, "D"),
c("South_Carolina", 14.4, 279, 48, 22.5, "D"),
c("South_Dakota", 3.8, 86, 45, 12.8, "D"),
c("Tennessee", 13.2, 188, 59, 26.9, "D"),
c("Texas", 12.7, 201, 80, 25.5, "D"),
c("Utah", 3.2, 120, 80, 22.9, "D"),
c("Vermont", 2.2, 48, 32, 11.2, "D"),
c("Virginia", 8.5, 156, 63, 20.7, "D"),
c("Washington", 4.0, 145, 73, 26.2, "D"),
c("West_Virginia", 5.7, 81, 39, 9.3, "D"),
c("Wisconsin", 2.6, 53, 66, 10.8, "D"),
c("Wyoming", 6.8, 161, 60, 15.6, "D")))
我需要将其转换为 data.frame(或 table),同时保留列名和行名、数字的数量并将其他任何内容(在此示例中的列 'Group')转换为因子. (数据并不总是这种格式,因此代码必须是通用的。)
(可选步骤是根据给定名称删除一列,这就是使用 data.frame 的原因,因为这很容易做到。)
然后,结果 data.frame(或 table,或矩阵)被传递到 'scale' 函数。
我的解决方案包括几个步骤:
data <- m[-1,-1]
colnames(data) <- m[1,-1]
rownames(data) <- m[-1,1][m[-1,1]!='']
data <- as.data.frame(data)
现在我有 data.frame,但它不能传递到 scale() 函数 ("Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric")。如果我使用 data.matrix(data) 函数,因子可以很好地整数化,但所有双精度数也会转换为整数。我坚持了几个小时。
提前致谢
阅读为data.frame
,稍后再做
m = data.frame(rbind.... you data here as above)
rownames(m) = m$X1
colnames(m) = c(t(m[1,]))
req.df = m[-1,-1]
我会把它移到一个答案中,因为它似乎无法通过评论工作。您可以执行以下操作
data <- data.frame(lapply(data.frame(m[-1,-1], stringsAsFactors = FALSE), type.convert))
这会将矩阵的所有列转换为正确的格式
str(data)
# 'data.frame': 50 obs. of 5 variables:
# $ X1: num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
# $ X2: int 236 263 294 190 276 204 110 238 335 211 ...
# $ X3: int 58 48 80 50 91 78 77 72 80 60 ...
# $ X4: num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
# $ X5: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
然后,您可以根据需要设置您的column/row名字
colnames(data) <- m[1,-1]
rownames(data) <- m[-1,1][m[-1,1]!='']
对于 scale
你可以做到
scale(data[-5])
根据 OP 评论进行编辑。
正如我已经多次说过的,在 factor
上使用 data.matrix
是完全错误的,它会完全弄乱您的数据。考虑以下示例
data.matrix(data.frame(A = factor(c("A", "B")),
B = factor(10:11),
C = factor(c("22-11-2014", "23-11-2014"))))
# A B C
# [1,] 1 1 1
# [2,] 2 2 2
data.matrix
为这些完全不同的值返回了相同的结果。
现在回到你的真实数据,如果你想避免 运行ning scale
因素,你先验不知道哪些列是因素,你可以简单地创建一个索引,它将识别数字列,然后仅在它们上 运行 scale
,例如
indx <- sapply(data, is.numeric)
scale(data[indx])
下面是一个可以保留数字和因子类型的快速试用。
# convert into data frame
df <- as.data.frame(m[2:nrow(m), 2:ncol(m)], stringsAsFactors = FALSE)
# set names
names(df) <- m[1, 2:ncol(m)]
rownames(df) <- m[2:nrow(m), 1]
# convert types into numeric or factor
df[] <- lapply(df, function(x) if(is.na(as.numeric(x[1]))) as.factor(x) else as.numeric(x))
str(df)
'data.frame': 50 obs. of 5 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : num 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: num 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
$ Group : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
我有以下方式给出的矩阵:
m <- as.matrix(rbind(c("State", "Murder", "Assault", "UrbanPop", "Rape", "Group"),
c("Alabama", 13.2, 236, 58, 21.2, "A"),
c("Alaska", 10.0, 263, 48, 44.5, "A"),
c("Arizona", 8.1, 294, 80, 31.0, "A"),
c("Arkansas", 8.8, 190, 50, 19.5, "A"),
c("California", 9.0, 276, 91, 40.6, "A"),
c("Colorado", 7.9, 204, 78, 38.7, "A"),
c("Connecticut", 3.3, 110, 77, 11.1, "A"),
c("Delaware", 5.9, 238, 72, 15.8, "A"),
c("Florida", 15.4, 335, 80, 31.9, "A"),
c("Georgia", 17.4, 211, 60, 25.8, "A"),
c("Hawaii", 5.3, 46, 83, 20.2, "A"),
c("Idaho", 2.6, 120, 54, 14.2, "A"),
c("Illinois", 10.4, 249, 83, 24.0, "A"),
c("Indiana", 7.2, 113, 65, 21.0, "A"),
c("Iowa", 2.2, 56, 57, 11.3, "A"),
c("Kansas", 6.0, 115, 66, 18.0, "A"),
c("Kentucky", 9.7, 109, 52, 16.3, "A"),
c("Louisiana", 15.4, 249, 66, 22.2, "A"),
c("Maine", 2.1, 83, 51, 7.8, "B"),
c("Maryland", 11.3, 300, 67, 27.8, "B"),
c("Massachusetts", 4.4, 149, 85, 16.3, "B"),
c("Michigan", 12.1, 255, 74, 35.1, "B"),
c("Minnesota", 2.7, 72, 66, 14.9, "B"),
c("Mississippi", 16.1, 259, 44, 17.1, "B"),
c("Missouri", 9.0, 178, 70, 28.2, "B"),
c("Montana", 6.0, 109, 53, 16.4, "B"),
c("Nebraska", 4.3, 102, 62, 16.5, "C"),
c("Nevada", 12.2, 252, 81, 46.0, "C"),
c("New_Hampshire", 2.1, 57, 56, 9.5, "C"),
c("New_Jersey", 7.4, 159, 89, 18.8, "C"),
c("New_Mexico", 11.4, 285, 70, 32.1, "C"),
c("New_York", 11.1, 254, 86, 26.1, "C"),
c("North_Carolina", 13.0, 337, 45, 16.1, "C"),
c("North_Dakota", 0.8, 45, 44, 7.3, "C"),
c("Ohio", 7.3, 120, 75, 21.4, "D"),
c("Oklahoma", 6.6, 151, 68, 20.0, "D"),
c("Oregon", 4.9, 159, 67, 29.3, "D"),
c("Pennsylvania", 6.3, 106, 72, 14.9, "D"),
c("Rhode_Island", 3.4, 174, 87, 8.3, "D"),
c("South_Carolina", 14.4, 279, 48, 22.5, "D"),
c("South_Dakota", 3.8, 86, 45, 12.8, "D"),
c("Tennessee", 13.2, 188, 59, 26.9, "D"),
c("Texas", 12.7, 201, 80, 25.5, "D"),
c("Utah", 3.2, 120, 80, 22.9, "D"),
c("Vermont", 2.2, 48, 32, 11.2, "D"),
c("Virginia", 8.5, 156, 63, 20.7, "D"),
c("Washington", 4.0, 145, 73, 26.2, "D"),
c("West_Virginia", 5.7, 81, 39, 9.3, "D"),
c("Wisconsin", 2.6, 53, 66, 10.8, "D"),
c("Wyoming", 6.8, 161, 60, 15.6, "D")))
我需要将其转换为 data.frame(或 table),同时保留列名和行名、数字的数量并将其他任何内容(在此示例中的列 'Group')转换为因子. (数据并不总是这种格式,因此代码必须是通用的。)
(可选步骤是根据给定名称删除一列,这就是使用 data.frame 的原因,因为这很容易做到。)
然后,结果 data.frame(或 table,或矩阵)被传递到 'scale' 函数。
我的解决方案包括几个步骤:
data <- m[-1,-1]
colnames(data) <- m[1,-1]
rownames(data) <- m[-1,1][m[-1,1]!='']
data <- as.data.frame(data)
现在我有 data.frame,但它不能传递到 scale() 函数 ("Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric")。如果我使用 data.matrix(data) 函数,因子可以很好地整数化,但所有双精度数也会转换为整数。我坚持了几个小时。
提前致谢
阅读为data.frame
,稍后再做
m = data.frame(rbind.... you data here as above)
rownames(m) = m$X1
colnames(m) = c(t(m[1,]))
req.df = m[-1,-1]
我会把它移到一个答案中,因为它似乎无法通过评论工作。您可以执行以下操作
data <- data.frame(lapply(data.frame(m[-1,-1], stringsAsFactors = FALSE), type.convert))
这会将矩阵的所有列转换为正确的格式
str(data)
# 'data.frame': 50 obs. of 5 variables:
# $ X1: num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
# $ X2: int 236 263 294 190 276 204 110 238 335 211 ...
# $ X3: int 58 48 80 50 91 78 77 72 80 60 ...
# $ X4: num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
# $ X5: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
然后,您可以根据需要设置您的column/row名字
colnames(data) <- m[1,-1]
rownames(data) <- m[-1,1][m[-1,1]!='']
对于 scale
你可以做到
scale(data[-5])
根据 OP 评论进行编辑。
正如我已经多次说过的,在 factor
上使用 data.matrix
是完全错误的,它会完全弄乱您的数据。考虑以下示例
data.matrix(data.frame(A = factor(c("A", "B")),
B = factor(10:11),
C = factor(c("22-11-2014", "23-11-2014"))))
# A B C
# [1,] 1 1 1
# [2,] 2 2 2
data.matrix
为这些完全不同的值返回了相同的结果。
现在回到你的真实数据,如果你想避免 运行ning scale
因素,你先验不知道哪些列是因素,你可以简单地创建一个索引,它将识别数字列,然后仅在它们上 运行 scale
,例如
indx <- sapply(data, is.numeric)
scale(data[indx])
下面是一个可以保留数字和因子类型的快速试用。
# convert into data frame
df <- as.data.frame(m[2:nrow(m), 2:ncol(m)], stringsAsFactors = FALSE)
# set names
names(df) <- m[1, 2:ncol(m)]
rownames(df) <- m[2:nrow(m), 1]
# convert types into numeric or factor
df[] <- lapply(df, function(x) if(is.na(as.numeric(x[1]))) as.factor(x) else as.numeric(x))
str(df)
'data.frame': 50 obs. of 5 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : num 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: num 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
$ Group : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...