如何将函数应用于多个列以在 R 中创建多个新列?
How to apply a function to multiple columns to create multiple new columns in R?
我有这个序列列表 aqi_range 和一个数据框 df:
aqi_range = list(0:50,51:100,101:250)
df
PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max
1 85.6 3 264 75.7 3 240
2 105. 6 243 76.4 3 191
3 95.8 19 287 48.4 8 134
4 85.5 50 166 64.8 32 103
5 55.9 24 117 46.7 19 77
6 37.5 6 116 31.3 3 87
7 26 5 69 15.5 3 49
8 82.3 34 169 49.6 25 120
9 170 68 272 133 67 201
10 254 189 323 226 173 269
现在我已经创建了这两个非常简单的函数,我想将它们应用于此数据框以计算每种污染物的 AQI=空气质量指数。
#a = column from a dataframe **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
min_max_diff <- function(a,b){
for (i in b){
if (a %in% i){
min_val = min(i)
max_val = max(i)
return (max_val - min_val)
}}}
#a = column from a dataframe **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
c_low <- function(a,b){
for (i in b){
if (a %in% i){
min_val = min(i)
return(min_val)
}
}}
基本上第一个函数 "min_max_diff" 获取列 df$PM10_mean / df$PM2.5_mean 的值并在列表 "aqi_range" 然后 returns 某个值(可用序列的最小值和最大值之差)。同样,第二个函数 "c_low" 只是 returns 序列的最小值。
我想将这种操作(下面定义的公式)应用于 PM10_mean 列以创建新列 PM10_AQI:
df$PM10_AQI = min_max_diff(df$PM10_mean,aqi_range) / (df$PM10_max - df$PM10_min) / * (df$PM10_mean - df$PM10_min) + c_low(df$PM10_mean,aqi_range)
我希望它能正确解释它。
如果您的问题只是如何将给定的转换计算为数据框中的多个列,您可以编写一个 for 循环,使用字符串转换函数构造转换中涉及的每个变量的名称(在本例中为 sub()
很有用),并使用 [
符号引用数据框中的列(与 $
符号相对——因为 [
符号接受字符串来指定列)。
下面我展示了一个这样的代码示例,其中包含一个带有 3 个观察值的小样本数据:
(请注意,我修改了 AQI 范围值的定义(现在我只定义范围变化的中断点——假设它们都是整数),以及你的函数 min_max_diff()
和 c_low()
,它们被合并为一个函数,返回找到值的 AQI 范围的最小值和最大值——再次假设 AQI 值是整数值)
# Definition of the AQI ranges (which are assumed to be based on integer values)
# Note that if the number of AQI ranges is k, the number of breaks is k+1
# Each break value defines the minimum of the range
# The maximum of each range is computed as the "minimum of the NEXT range" - 1
# (again this assumes integer values in AQI ranges)
# The values (e.g. PM10_mean) whose AQI range is searched for are assumed
# to NOT be larger than or equal to the largest break value.
aqi_range_breaks = c(0, 51, 101, 251)
# Example data (top 3 rows of the data frame you provided)
df = data.frame(PM10_mean=c(85.6, 105.0, 95.8),
PM10_min=c(3, 6, 19),
PM10_max=c(264, 243, 287),
PM2.5_mean=c(75.7, 76.4, 48.4),
PM2.5_min=c(3, 3, 8),
PM2.5_max=c(240, 191, 134))
# Function that returns the minimum and maximum AQI values
# of the AQI range where the given values are found
# `values`: array of values that are searched for in the AQI ranges
# defined by the second parameter.
# `aqi_range_breaks`: breaks defining the minimum values of each AQI range
# plus one last value defining a value never attained by `values`.
# (all values in this parameter defining the AQI ranges are assumed integer values)
find_aqi_range_min_max <- function(values, aqi_range_breaks){
aqi_range_groups = findInterval(values, aqi_range_breaks)
return( list(min=aqi_range_breaks[aqi_range_groups],
max=aqi_range_breaks[aqi_range_groups + 1] - 1))
}
# Run the variable transformation on the selected `_mean` columns
vars_mean = c("PM10_mean", "PM2.5_mean")
for (vmean in vars_mean) {
vmin = sub("_mean$", "_min", vmean)
vmax = sub("_mean$", "_max", vmean)
vaqi = sub("_mean$", "_AQI", vmean)
aqi_range_min_max = find_aqi_range_min_max(df[,vmean], aqi_range_breaks)
df[,vaqi] = (aqi_range_min_max$max - aqi_range_min_max$min) /
(df[,vmax] - df[,vmin]) / (df[,vmean] - df[,vmin]) +
aqi_range_min_max$min
}
请注意 findInterval()
函数如何用于查找 值数组 所在的范围。这是使数据框列的转换工作的关键。
这个过程的预期输出是:
PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max PM10_AQI PM2.5_AQI
1 85.6 3 264 75.7 3 240 51.00227 51.002843893
2 105.0 6 243 76.4 3 191 101.00635 51.003550930
3 95.8 19 287 48.4 8 134 51.00238 0.009822411
请检查计算 AQI 的公式,因为其中存在语法错误(查找 / *
,我已将其替换为代码中公式中的 /
)。
注意在sub()
中使用的正则表达式中使用$
来匹配字符串"_mean"
,只有在"_mean"
出现时才用于替换"_mean"
在变量名的末尾。
我有这个序列列表 aqi_range 和一个数据框 df:
aqi_range = list(0:50,51:100,101:250)
df
PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max
1 85.6 3 264 75.7 3 240
2 105. 6 243 76.4 3 191
3 95.8 19 287 48.4 8 134
4 85.5 50 166 64.8 32 103
5 55.9 24 117 46.7 19 77
6 37.5 6 116 31.3 3 87
7 26 5 69 15.5 3 49
8 82.3 34 169 49.6 25 120
9 170 68 272 133 67 201
10 254 189 323 226 173 269
现在我已经创建了这两个非常简单的函数,我想将它们应用于此数据框以计算每种污染物的 AQI=空气质量指数。
#a = column from a dataframe **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
min_max_diff <- function(a,b){
for (i in b){
if (a %in% i){
min_val = min(i)
max_val = max(i)
return (max_val - min_val)
}}}
#a = column from a dataframe **PM10_mean, PM2.5_mean**
#b = list of sequences defined above
c_low <- function(a,b){
for (i in b){
if (a %in% i){
min_val = min(i)
return(min_val)
}
}}
基本上第一个函数 "min_max_diff" 获取列 df$PM10_mean / df$PM2.5_mean 的值并在列表 "aqi_range" 然后 returns 某个值(可用序列的最小值和最大值之差)。同样,第二个函数 "c_low" 只是 returns 序列的最小值。
我想将这种操作(下面定义的公式)应用于 PM10_mean 列以创建新列 PM10_AQI:
df$PM10_AQI = min_max_diff(df$PM10_mean,aqi_range) / (df$PM10_max - df$PM10_min) / * (df$PM10_mean - df$PM10_min) + c_low(df$PM10_mean,aqi_range)
我希望它能正确解释它。
如果您的问题只是如何将给定的转换计算为数据框中的多个列,您可以编写一个 for 循环,使用字符串转换函数构造转换中涉及的每个变量的名称(在本例中为 sub()
很有用),并使用 [
符号引用数据框中的列(与 $
符号相对——因为 [
符号接受字符串来指定列)。
下面我展示了一个这样的代码示例,其中包含一个带有 3 个观察值的小样本数据:
(请注意,我修改了 AQI 范围值的定义(现在我只定义范围变化的中断点——假设它们都是整数),以及你的函数 min_max_diff()
和 c_low()
,它们被合并为一个函数,返回找到值的 AQI 范围的最小值和最大值——再次假设 AQI 值是整数值)
# Definition of the AQI ranges (which are assumed to be based on integer values)
# Note that if the number of AQI ranges is k, the number of breaks is k+1
# Each break value defines the minimum of the range
# The maximum of each range is computed as the "minimum of the NEXT range" - 1
# (again this assumes integer values in AQI ranges)
# The values (e.g. PM10_mean) whose AQI range is searched for are assumed
# to NOT be larger than or equal to the largest break value.
aqi_range_breaks = c(0, 51, 101, 251)
# Example data (top 3 rows of the data frame you provided)
df = data.frame(PM10_mean=c(85.6, 105.0, 95.8),
PM10_min=c(3, 6, 19),
PM10_max=c(264, 243, 287),
PM2.5_mean=c(75.7, 76.4, 48.4),
PM2.5_min=c(3, 3, 8),
PM2.5_max=c(240, 191, 134))
# Function that returns the minimum and maximum AQI values
# of the AQI range where the given values are found
# `values`: array of values that are searched for in the AQI ranges
# defined by the second parameter.
# `aqi_range_breaks`: breaks defining the minimum values of each AQI range
# plus one last value defining a value never attained by `values`.
# (all values in this parameter defining the AQI ranges are assumed integer values)
find_aqi_range_min_max <- function(values, aqi_range_breaks){
aqi_range_groups = findInterval(values, aqi_range_breaks)
return( list(min=aqi_range_breaks[aqi_range_groups],
max=aqi_range_breaks[aqi_range_groups + 1] - 1))
}
# Run the variable transformation on the selected `_mean` columns
vars_mean = c("PM10_mean", "PM2.5_mean")
for (vmean in vars_mean) {
vmin = sub("_mean$", "_min", vmean)
vmax = sub("_mean$", "_max", vmean)
vaqi = sub("_mean$", "_AQI", vmean)
aqi_range_min_max = find_aqi_range_min_max(df[,vmean], aqi_range_breaks)
df[,vaqi] = (aqi_range_min_max$max - aqi_range_min_max$min) /
(df[,vmax] - df[,vmin]) / (df[,vmean] - df[,vmin]) +
aqi_range_min_max$min
}
请注意 findInterval()
函数如何用于查找 值数组 所在的范围。这是使数据框列的转换工作的关键。
这个过程的预期输出是:
PM10_mean PM10_min PM10_max PM2.5_mean PM2.5_min PM2.5_max PM10_AQI PM2.5_AQI
1 85.6 3 264 75.7 3 240 51.00227 51.002843893
2 105.0 6 243 76.4 3 191 101.00635 51.003550930
3 95.8 19 287 48.4 8 134 51.00238 0.009822411
请检查计算 AQI 的公式,因为其中存在语法错误(查找 / *
,我已将其替换为代码中公式中的 /
)。
注意在sub()
中使用的正则表达式中使用$
来匹配字符串"_mean"
,只有在"_mean"
出现时才用于替换"_mean"
在变量名的末尾。