将数据框从因子转换为数值会创建所有 NA
Converting a dataframe from factors to numerical creates all NA's
我有一个非常广泛的数据框,其中包含以下内容:
character factor labelled numeric
6 1 945 2
其中标记来自 haven
包(Stata 导入)并用作因子。请参阅下面的一些示例数据:
matchcode S001 S002 S003 S003A S004 S006 S007 S007_01 S009 S009A S010 S011 S012 S013 S013B S016 S017 S017A S018 S018A S019 S019A S020 S021
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl+lbl>
1 "JPN 198~ 2 1 392 392 NA 494 494 3920120494 JP JP NA NA NA NA NA NA 1.0897217 1.0897217 0.9050845 0.9050845 1.3576267 1.3576267 1981 39201211981
2 "JPN 198~ 2 1 392 392 NA 115 115 3920120115 JP JP NA NA NA NA NA NA 0.6789805 0.6789805 0.5639373 0.5639373 0.8459059 0.8459059 1981 39201211981
3 "JPN 198~ 2 1 392 392 NA 949 949 3920120949 JP JP NA NA NA NA NA NA 1.0897217 1.0897217 0.9050845 0.9050845 1.3576267 1.3576267 1981 39201211981
4 "MEX 198~ 2 1 484 484 NA 112 1315 4840120111 MX MX NA NA NA NA NA NA 0.7965188 0.7965188 0.4335976 0.4335976 0.6503964 0.6503964 1981 48401211981
5 "MEX 198~ 2 1 484 484 NA 1042 2238 4840121034 MX MX NA NA NA NA NA NA 1.1378840 1.1378840 0.6194252 0.6194252 0.9291378 0.9291378 1981 48401211981
6 "MEX 198~ 2 1 484 484 NA 1315 2510 4840121306 MX MX NA NA NA NA NA NA 1.1378840 1.1378840 0.6194252 0.6194252 0.9291378 0.9291378 1981 48401211981
7 "HUN 198~ 2 1 348 348 NA 250 3291 3480120250 HU HU NA NA NA NA NA NA 1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982 34801211982
8 "HUN 198~ 2 1 348 348 NA 943 3984 3480120943 HU HU NA NA NA NA NA NA 1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982 34801211982
9 "HUN 198~ 2 1 348 348 NA 726 3767 3480120726 HU HU NA NA NA NA NA NA 1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982 34801211982
10 "AUS 198~ 2 1 36 36 NA 342 4847 360120342 AU AU NA NA NA NA NA NA 0.9616138 0.9616138 0.7830731 0.7830731 1.1746096 1.1746096 1981 3601211981
我将数据集中的负数转换为 NA(它们是);
df[df < 0] <- NA
df<- df[,colMeans(is.na(df)) <= 0.999]
我想通过以下方式将所有因子转换为数值(以便以后能够取每个值的平均值):
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
df[] = lapply(df, as.numeric.factor)
这最初有效。然而,在用 NA 替换所有负数之后
它不再这样做了,一切都变成了 NA。似乎该功能在处理 NA 时遇到了麻烦?如果有,我该如何处理?
我们的想法是最终总结(取平均值)每个国家/地区年份的每个变量:
cols = sapply(WVS, is.numeric)
cols = names(cols)[cols]
dfclevel= df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
最后我试图改变它来绕过NA;
df <- as.data.frame(df)
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
df[] = lapply(df, as.numeric.factor)
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
df[df < 0] <- NA
df <- df[,colMeans(is.na(df)) <= 0.999]
df <- data.table(df)
dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
但后来我得到:
> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
Error in `[.data.frame`(df, , lapply(.SD, mean, na.rm = TRUE), :
unused arguments (.SDcols = cols, by = matchcode)
> df <- data.table(df)
> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
Error in `[.data.table`(df, , lapply(.SD, mean, na.rm = TRUE), :
Some items of .SDcols are not column names (or are NA)
我试过没有 .SDcols=cols
,然后我得到:
> df <- as.data.frame(df)
> as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
> df[] = lapply(df, as.numeric.factor)
Error in `[<-.data.frame`(`*tmp*`, , value = list(matchcode = c(NA_real_, :
replacement element 6 has 717 rows, need 720
In addition: Warning message:
In FUN(X[[i]], ...) : NAs introduced by coercion
> df <- data.table(df)
> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), by=matchcode]
Error in gmean(S009, na.rm = TRUE) :
Type 'character' not supported by GForce mean (gmean) na.rm=TRUE. Either add the prefix base::mean(.) or turn off GForce optimization using options(datatable.optimize=1)
我一直在为这个问题苦苦挣扎,非常感谢您的帮助。
坚持OP的最后一个方法;用于转换 NA 的函数需要替换为一个虽然效率较低但可以处理 NA 的函数,即;
as.numeric(as.character(x))
代码则变为:
df <- as.data.frame(df)
as.numeric.factor <- function(x) {as.numeric(as.character(x))}
df[] = lapply(df, as.numeric.factor)
df[df < 0] <- NA
df <- df[,colMeans(is.na(df)) <= 0.999]
df <- data.table(df)
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
我有一个非常广泛的数据框,其中包含以下内容:
character factor labelled numeric
6 1 945 2
其中标记来自 haven
包(Stata 导入)并用作因子。请参阅下面的一些示例数据:
matchcode S001 S002 S003 S003A S004 S006 S007 S007_01 S009 S009A S010 S011 S012 S013 S013B S016 S017 S017A S018 S018A S019 S019A S020 S021
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl+lbl>
1 "JPN 198~ 2 1 392 392 NA 494 494 3920120494 JP JP NA NA NA NA NA NA 1.0897217 1.0897217 0.9050845 0.9050845 1.3576267 1.3576267 1981 39201211981
2 "JPN 198~ 2 1 392 392 NA 115 115 3920120115 JP JP NA NA NA NA NA NA 0.6789805 0.6789805 0.5639373 0.5639373 0.8459059 0.8459059 1981 39201211981
3 "JPN 198~ 2 1 392 392 NA 949 949 3920120949 JP JP NA NA NA NA NA NA 1.0897217 1.0897217 0.9050845 0.9050845 1.3576267 1.3576267 1981 39201211981
4 "MEX 198~ 2 1 484 484 NA 112 1315 4840120111 MX MX NA NA NA NA NA NA 0.7965188 0.7965188 0.4335976 0.4335976 0.6503964 0.6503964 1981 48401211981
5 "MEX 198~ 2 1 484 484 NA 1042 2238 4840121034 MX MX NA NA NA NA NA NA 1.1378840 1.1378840 0.6194252 0.6194252 0.9291378 0.9291378 1981 48401211981
6 "MEX 198~ 2 1 484 484 NA 1315 2510 4840121306 MX MX NA NA NA NA NA NA 1.1378840 1.1378840 0.6194252 0.6194252 0.9291378 0.9291378 1981 48401211981
7 "HUN 198~ 2 1 348 348 NA 250 3291 3480120250 HU HU NA NA NA NA NA NA 1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982 34801211982
8 "HUN 198~ 2 1 348 348 NA 943 3984 3480120943 HU HU NA NA NA NA NA NA 1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982 34801211982
9 "HUN 198~ 2 1 348 348 NA 726 3767 3480120726 HU HU NA NA NA NA NA NA 1.0635516 1.0635516 0.7264696 0.7264696 1.0897045 1.0897045 1982 34801211982
10 "AUS 198~ 2 1 36 36 NA 342 4847 360120342 AU AU NA NA NA NA NA NA 0.9616138 0.9616138 0.7830731 0.7830731 1.1746096 1.1746096 1981 3601211981
我将数据集中的负数转换为 NA(它们是);
df[df < 0] <- NA
df<- df[,colMeans(is.na(df)) <= 0.999]
我想通过以下方式将所有因子转换为数值(以便以后能够取每个值的平均值):
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
df[] = lapply(df, as.numeric.factor)
这最初有效。然而,在用 NA 替换所有负数之后 它不再这样做了,一切都变成了 NA。似乎该功能在处理 NA 时遇到了麻烦?如果有,我该如何处理?
我们的想法是最终总结(取平均值)每个国家/地区年份的每个变量:
cols = sapply(WVS, is.numeric)
cols = names(cols)[cols]
dfclevel= df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
最后我试图改变它来绕过NA;
df <- as.data.frame(df)
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
df[] = lapply(df, as.numeric.factor)
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
df[df < 0] <- NA
df <- df[,colMeans(is.na(df)) <= 0.999]
df <- data.table(df)
dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
但后来我得到:
> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
Error in `[.data.frame`(df, , lapply(.SD, mean, na.rm = TRUE), :
unused arguments (.SDcols = cols, by = matchcode)
> df <- data.table(df)
> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]
Error in `[.data.table`(df, , lapply(.SD, mean, na.rm = TRUE), :
Some items of .SDcols are not column names (or are NA)
我试过没有 .SDcols=cols
,然后我得到:
> df <- as.data.frame(df)
> as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
> df[] = lapply(df, as.numeric.factor)
Error in `[<-.data.frame`(`*tmp*`, , value = list(matchcode = c(NA_real_, :
replacement element 6 has 717 rows, need 720
In addition: Warning message:
In FUN(X[[i]], ...) : NAs introduced by coercion
> df <- data.table(df)
> dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), by=matchcode]
Error in gmean(S009, na.rm = TRUE) :
Type 'character' not supported by GForce mean (gmean) na.rm=TRUE. Either add the prefix base::mean(.) or turn off GForce optimization using options(datatable.optimize=1)
我一直在为这个问题苦苦挣扎,非常感谢您的帮助。
坚持OP的最后一个方法;用于转换 NA 的函数需要替换为一个虽然效率较低但可以处理 NA 的函数,即;
as.numeric(as.character(x))
代码则变为:
df <- as.data.frame(df)
as.numeric.factor <- function(x) {as.numeric(as.character(x))}
df[] = lapply(df, as.numeric.factor)
df[df < 0] <- NA
df <- df[,colMeans(is.na(df)) <= 0.999]
df <- data.table(df)
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
dfclevel = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=matchcode]