使用 tidyverse (dplyr) 规范化混合 numeric/non-numeric DataFrame 中的列?
Normalizing columns in mixed numeric/non-numeric DataFrame with tidyverse (dplyr)?
我经常需要规范化混合了数字和非数字列的 DataFrame 列。有时我知道数字列的名称,有时我不知道。
我尝试了在我看来非常合乎逻辑的整洁评估方法。大多数不起作用。我只找到了一个。
为了更好地理解 tidy evaluation,我能否解释一下为什么以下方法有效或无效?
library(tidyverse)
df = data.frame(
A=runif(10, 1, 10),
B=runif(10, 1, 10),
C=rep(0, 10),
D=LETTERS[1:10]
)
df
#> A B C D
#> 1 2.157171 1.434351 0 A
#> 2 7.746638 6.987983 0 B
#> 3 7.861337 1.528145 0 C
#> 4 8.657990 4.101441 0 D
#> 5 8.307844 5.809815 0 E
#> 6 1.376084 9.202047 0 F
#> 7 7.197999 5.532681 0 G
#> 8 1.878676 1.012917 0 H
#> 9 2.231955 4.572273 0 I
#> 10 4.340488 2.640728 0 J
print("Does normalize columns, but can't handle col of 0s")
#> [1] "Does normalize columns, but can't handle col of 0s"
test = df %>% mutate_if(is.numeric, ~./sum(.))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 1 1 NaN
print("Virtually the same as above, but tries to handle col of 0s, but doesn't work")
#> [1] "Virtually the same as above, but tries to handle col of 0s, but doesn't work"
test = df %>% mutate_if(is.numeric, ~ifelse(sum(.)>0, ./sum(.), 0))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 0.4167949 0.3349536 0.0000000
print("Does normalize columns, but can't handle col of 0s")
#> [1] "Does normalize columns, but can't handle col of 0s"
test = df %>% mutate_if(is.numeric, function(x) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 1 1 NaN
print("Virtually the same as above, but tries to handle col of 0s, but doesn't work")
#> [1] "Virtually the same as above, but tries to handle col of 0s, but doesn't work"
test = df %>% mutate_if(is.numeric, function(x) ifelse(sum(x)>0, x/sum(x), 0))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 0.4167949 0.3349536 0.0000000
print("Strange error I don't understand")
#> [1] "Strange error I don't understand"
test = df %>% mutate_if(is.numeric, ~apply(., 2, function(x) x/sum(x)))
#> Error in apply(., 2, function(x) x/sum(x)): dim(X) must have a positive length
print("THIS DOES WORK! Why?")
#> [1] "THIS DOES WORK! Why?"
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
#> A B
#> 1 1
由 reprex package (v0.3.0)
于 2019-10-29 创建
编辑!!!
确认!刚发现一个大问题
在最后一个示例中,"works",0 列被删除。我完全不明白这一点。我想保留该列,只是不想对其进行标准化。
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) x/sum(x))
> test
# A B D
# 1 0.15571120 0.12033237 A
# 2 0.10561824 0.11198394 B
# 3 0.06041408 0.12068372 C
# 4 0.16785724 0.06241538 D
# 5 0.03112945 0.02559354 E
# 6 0.02791520 0.06363215 F
# 7 0.17132200 0.16625761 G
# 8 0.06641540 0.14038458 H
# 9 0.04015548 0.12420858 I
# 10 0.17346171 0.06450813 J
编辑 2
发现我需要包括 else
。
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) {x/sum(x)}else{0})
> test
# A B C D
# 1 0.15571120 0.12033237 0 A
# 2 0.10561824 0.11198394 0 B
# 3 0.06041408 0.12068372 0 C
# 4 0.16785724 0.06241538 0 D
# 5 0.03112945 0.02559354 0 E
# 6 0.02791520 0.06363215 0 F
# 7 0.17132200 0.16625761 0 G
# 8 0.06641540 0.14038458 0 H
# 9 0.04015548 0.12420858 0 I
# 10 0.17346171 0.06450813 0 J
numeric_columns =
df %>%
select_if(is.numeric) %>%
colnames()
test = df %>% mutate_at(numeric_columns, function(x) if (sum(x) > 0) x/sum(x))
> test
# A B C D
# 1 0.15571120 0.12033237 0 A
# 2 0.10561824 0.11198394 0 B
# 3 0.06041408 0.12068372 0 C
# 4 0.16785724 0.06241538 0 D
# 5 0.03112945 0.02559354 0 E
# 6 0.02791520 0.06363215 0 F
# 7 0.17132200 0.16625761 0 G
# 8 0.06641540 0.14038458 0 H
# 9 0.04015548 0.12420858 0 I
# 10 0.17346171 0.06450813 0 J
第一个问题
test = df %>% mutate_if(is.numeric, ~./sum(.))
test %>% select_if(is.numeric) %>% colSums( ,na.rm = T)
test = df %>% mutate_if(is.numeric, function(x) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
您可以通过指定 na.rm = T
来处理您的问题,这样您就不会保留 NA
。
它们的发生是因为你除以 0。
第二种语法也是一样的。 mutate_if
为每个数字列应用所需的操作,因此对于第三个它 returns Nan 因为 0.
第二题
test = df %>% mutate_if(is.numeric, function(x){ifelse(x > 0, x/sum(x), rep(0, length(x)))})
test %>% select_if(is.numeric) %>% colSums()
test = df %>% mutate_if(is.numeric, function(x) ifelse(sum(x)>0, x/sum(x), 0))
test %>% select_if(is.numeric) %>% colSums()
ifelse returns 是一个与 test 形状相同的值,所以在你的情况下,因为你检查 'sum(x) > 0' 你 return 只有第一个值。参见:
https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/ifelse
第三题
test = df %>% mutate_if(is.numeric, ~apply(., 2, function(x) x/sum(x)))
在这里,这很棘手,mutate_if 通过向量应用,你想接下来使用应用,但你的对象是一个向量,应用只对像 matrix
或 [=20= 这样的对象是正确的] 至少有两列。
一个好的答案
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
确实这是一个正确的语法,因为 if
不需要 return 特定大小的对象。
不过,您也可以使用 ifelse
,但在矢量条件下,如果至少有一个元素不为 0,则正值之和确实不为 nul。
test = df %>% mutate_if(is.numeric, function(x){ifelse(x > 0, x/sum(x), rep(0, length(x)))})
test %>% select_if(is.numeric) %>% colSums()
我希望它能帮助您了解出现错误时发生了什么。解决方案不是唯一的。
编辑 1:
原因是:只有当总和严格大于 0 时,您才 return 某些东西。如果不是,您必须指定要做什么。例如:
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0){x/sum(x)}else{0})
@Rémi Coulaud 已经很好地解释了为什么事情 work/don 不起作用。现在,处理这个问题的另一种方法可能是(根据 @42- 的评论更新):
df %>%
mutate_if(~ is.numeric(.) && sum(.) != 0, ~ ./sum(.))
A B C D
1 0.15735803 0.12131787 0 A
2 0.08098114 0.10229536 0 B
3 0.06108911 0.09802935 0 C
4 0.13152492 0.15719599 0 D
5 0.10684839 0.10477812 0 E
6 0.14204157 0.10385447 0 F
7 0.09731823 0.11015997 0 G
8 0.15532621 0.10458007 0 H
9 0.02579446 0.05748756 0 I
10 0.04171793 0.04030124 0 J
然后:
df %>%
mutate_if(~ is.numeric(.) && sum(.) != 0, ~ ./sum(.)) %>%
select_if(is.numeric) %>%
colSums()
A B C
1 1 0
我经常需要规范化混合了数字和非数字列的 DataFrame 列。有时我知道数字列的名称,有时我不知道。
我尝试了在我看来非常合乎逻辑的整洁评估方法。大多数不起作用。我只找到了一个。
为了更好地理解 tidy evaluation,我能否解释一下为什么以下方法有效或无效?
library(tidyverse)
df = data.frame(
A=runif(10, 1, 10),
B=runif(10, 1, 10),
C=rep(0, 10),
D=LETTERS[1:10]
)
df
#> A B C D
#> 1 2.157171 1.434351 0 A
#> 2 7.746638 6.987983 0 B
#> 3 7.861337 1.528145 0 C
#> 4 8.657990 4.101441 0 D
#> 5 8.307844 5.809815 0 E
#> 6 1.376084 9.202047 0 F
#> 7 7.197999 5.532681 0 G
#> 8 1.878676 1.012917 0 H
#> 9 2.231955 4.572273 0 I
#> 10 4.340488 2.640728 0 J
print("Does normalize columns, but can't handle col of 0s")
#> [1] "Does normalize columns, but can't handle col of 0s"
test = df %>% mutate_if(is.numeric, ~./sum(.))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 1 1 NaN
print("Virtually the same as above, but tries to handle col of 0s, but doesn't work")
#> [1] "Virtually the same as above, but tries to handle col of 0s, but doesn't work"
test = df %>% mutate_if(is.numeric, ~ifelse(sum(.)>0, ./sum(.), 0))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 0.4167949 0.3349536 0.0000000
print("Does normalize columns, but can't handle col of 0s")
#> [1] "Does normalize columns, but can't handle col of 0s"
test = df %>% mutate_if(is.numeric, function(x) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 1 1 NaN
print("Virtually the same as above, but tries to handle col of 0s, but doesn't work")
#> [1] "Virtually the same as above, but tries to handle col of 0s, but doesn't work"
test = df %>% mutate_if(is.numeric, function(x) ifelse(sum(x)>0, x/sum(x), 0))
test %>% select_if(is.numeric) %>% colSums()
#> A B C
#> 0.4167949 0.3349536 0.0000000
print("Strange error I don't understand")
#> [1] "Strange error I don't understand"
test = df %>% mutate_if(is.numeric, ~apply(., 2, function(x) x/sum(x)))
#> Error in apply(., 2, function(x) x/sum(x)): dim(X) must have a positive length
print("THIS DOES WORK! Why?")
#> [1] "THIS DOES WORK! Why?"
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
#> A B
#> 1 1
由 reprex package (v0.3.0)
于 2019-10-29 创建编辑!!!
确认!刚发现一个大问题 在最后一个示例中,"works",0 列被删除。我完全不明白这一点。我想保留该列,只是不想对其进行标准化。
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) x/sum(x))
> test
# A B D
# 1 0.15571120 0.12033237 A
# 2 0.10561824 0.11198394 B
# 3 0.06041408 0.12068372 C
# 4 0.16785724 0.06241538 D
# 5 0.03112945 0.02559354 E
# 6 0.02791520 0.06363215 F
# 7 0.17132200 0.16625761 G
# 8 0.06641540 0.14038458 H
# 9 0.04015548 0.12420858 I
# 10 0.17346171 0.06450813 J
编辑 2
发现我需要包括 else
。
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) {x/sum(x)}else{0})
> test
# A B C D
# 1 0.15571120 0.12033237 0 A
# 2 0.10561824 0.11198394 0 B
# 3 0.06041408 0.12068372 0 C
# 4 0.16785724 0.06241538 0 D
# 5 0.03112945 0.02559354 0 E
# 6 0.02791520 0.06363215 0 F
# 7 0.17132200 0.16625761 0 G
# 8 0.06641540 0.14038458 0 H
# 9 0.04015548 0.12420858 0 I
# 10 0.17346171 0.06450813 0 J
numeric_columns =
df %>%
select_if(is.numeric) %>%
colnames()
test = df %>% mutate_at(numeric_columns, function(x) if (sum(x) > 0) x/sum(x))
> test
# A B C D
# 1 0.15571120 0.12033237 0 A
# 2 0.10561824 0.11198394 0 B
# 3 0.06041408 0.12068372 0 C
# 4 0.16785724 0.06241538 0 D
# 5 0.03112945 0.02559354 0 E
# 6 0.02791520 0.06363215 0 F
# 7 0.17132200 0.16625761 0 G
# 8 0.06641540 0.14038458 0 H
# 9 0.04015548 0.12420858 0 I
# 10 0.17346171 0.06450813 0 J
第一个问题
test = df %>% mutate_if(is.numeric, ~./sum(.))
test %>% select_if(is.numeric) %>% colSums( ,na.rm = T)
test = df %>% mutate_if(is.numeric, function(x) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
您可以通过指定 na.rm = T
来处理您的问题,这样您就不会保留 NA
。
它们的发生是因为你除以 0。
第二种语法也是一样的。 mutate_if
为每个数字列应用所需的操作,因此对于第三个它 returns Nan 因为 0.
第二题
test = df %>% mutate_if(is.numeric, function(x){ifelse(x > 0, x/sum(x), rep(0, length(x)))})
test %>% select_if(is.numeric) %>% colSums()
test = df %>% mutate_if(is.numeric, function(x) ifelse(sum(x)>0, x/sum(x), 0))
test %>% select_if(is.numeric) %>% colSums()
ifelse returns 是一个与 test 形状相同的值,所以在你的情况下,因为你检查 'sum(x) > 0' 你 return 只有第一个值。参见:
https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/ifelse
第三题
test = df %>% mutate_if(is.numeric, ~apply(., 2, function(x) x/sum(x)))
在这里,这很棘手,mutate_if 通过向量应用,你想接下来使用应用,但你的对象是一个向量,应用只对像 matrix
或 [=20= 这样的对象是正确的] 至少有两列。
一个好的答案
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0) x/sum(x))
test %>% select_if(is.numeric) %>% colSums()
确实这是一个正确的语法,因为 if
不需要 return 特定大小的对象。
不过,您也可以使用 ifelse
,但在矢量条件下,如果至少有一个元素不为 0,则正值之和确实不为 nul。
test = df %>% mutate_if(is.numeric, function(x){ifelse(x > 0, x/sum(x), rep(0, length(x)))})
test %>% select_if(is.numeric) %>% colSums()
我希望它能帮助您了解出现错误时发生了什么。解决方案不是唯一的。
编辑 1:
原因是:只有当总和严格大于 0 时,您才 return 某些东西。如果不是,您必须指定要做什么。例如:
test = df %>% mutate_if(is.numeric, function(x) if(sum(x)>0){x/sum(x)}else{0})
@Rémi Coulaud 已经很好地解释了为什么事情 work/don 不起作用。现在,处理这个问题的另一种方法可能是(根据 @42- 的评论更新):
df %>%
mutate_if(~ is.numeric(.) && sum(.) != 0, ~ ./sum(.))
A B C D
1 0.15735803 0.12131787 0 A
2 0.08098114 0.10229536 0 B
3 0.06108911 0.09802935 0 C
4 0.13152492 0.15719599 0 D
5 0.10684839 0.10477812 0 E
6 0.14204157 0.10385447 0 F
7 0.09731823 0.11015997 0 G
8 0.15532621 0.10458007 0 H
9 0.02579446 0.05748756 0 I
10 0.04171793 0.04030124 0 J
然后:
df %>%
mutate_if(~ is.numeric(.) && sum(.) != 0, ~ ./sum(.)) %>%
select_if(is.numeric) %>%
colSums()
A B C
1 1 0