R 中建模的数据整理
Data Wrangling for Modeling in R
我有一个数据集(原版,# A tibble: 33,478 x 12
),形式类似附图,部分数据:
dput(head(canals2, n=10))
structure(list(Site = c(1, 2, 4, 11, 10, 12, 13, 14, 15, 16),
`Sample Date` = c("2/11/2004", "2/11/2004", "2/11/2004",
"2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004",
"2/11/2004", "2/11/2004"), `Analysis code` = c("NH3", "NH3",
"NH3", "Chl a", "Chl a", "Chl a", "NH3", "Chl a", "NH3",
"NH3"), Analysis = c("Ammonia-Nitrogen", "Ammonia-Nitrogen",
"Ammonia-Nitrogen", "Chlorophyll a", "Chlorophyll a", "Chlorophyll a",
"Ammonia-Nitrogen", "Chlorophyll a", "Ammonia-Nitrogen",
"Ammonia-Nitrogen"), Result = c(0.068, 0.07, 0.014, 1.31,
1.39, 1.95, 0.247, 1.46, 0.113, 0.17), Units = c("mg/L",
"mg/L", "mg/L", "mg/m3", "mg/m3", "mg/m3", "mg/L", "mg/m3",
"mg/L", "mg/L")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
我想尝试使用线性模型(例如,使用 lm()
函数)从“氨氮”中预测“叶绿素 a”。 lm()
将列名作为 'formula' 的输入,但此数据集的生成方式非常不同。我应该为每个分析使用 Results
列中的值,但我似乎找不到组织数据的好方法。
到目前为止,我尝试按分析拆分数据,目的是为每个分析创建一个新的数据框,然后将 Result
替换为该数据框中所选分析的名称。这是我使用的函数(运行 它在主数据集上,这就是它包含更多分析名称的原因):
analysis_list = unique(canals$Analysis)
> analysis_list
1“氨氮”“叶绿素 a”“粪大肠菌群”
[4]“电导率”“铜”“溶解氧”
[7]“大肠杆菌”“肠球菌”“亚硝酸盐+硝酸盐”
[10]“正磷酸盐”“pH”“盐度”
[13]“温度”“总凯氏氮”“总氮”
[16]“总磷”“浊度”
split_analyses <- function()
canals_byAnalysis <- vector(mode = "list", length = 0)
for (i in 1:17) {
analysis <- analysis_list[i]
updated_analysis <- canals %>%
subset(Analysis == analysis,
select = -c(`Analysis code`))
canals_byAnalysis[[i]] <- updated_analysis
}
split_analyses()
不幸的是,这没有按预期工作,我在合并我创建的表时遇到了很多问题。我也尝试了其他方法,但我无处可去。有没有人愿意提供一些建议?
如果我没理解错的话,那么听起来您正在尝试重组数据以将其转换为适合建模的正确形式。我认为使用 pivot_wider
(来自 tidyr
)会得到你想要的。这是我所做的:
首先,这是您作为数据框的数据:
Site <- c(1, 2, 4, 11, 10, 12, 13, 14, 15, 16)
Sample_Date <- c("2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004",
"2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004")
Analysis_code <- c("NH3", "NH3", "NH3", "Chl a", "Chl a", "Chl a", "NH3", "Chl
a", "NH3", "NH3")
Analysis <- c("Ammonia-Nitrogen", "Ammonia-Nitrogen", "Ammonia-Nitrogen",
"Chlorophyll a", "Chlorophyll a", "Chlorophyll a", "Ammonia-Nitrogen",
"Chlorophyll a", "Ammonia-Nitrogen", "Ammonia-Nitrogen")
Results <- c(0.068, 0.07, 0.014, 1.31, 1.39, 1.95, 0.247, 1.46, 0.113, 0.17)
Units <- c("mg/L", "mg/L", "mg/L", "mg/m3", "mg/m3", "mg/m3", "mg/L", "mg/m3",
"mg/L", "mg/L")
Site Sample_Date Analysis_code Analysis Results Units
1 1 2/11/2004 NH3 Ammonia-Nitrogen 0.068 mg/L
2 2 2/11/2004 NH3 Ammonia-Nitrogen 0.070 mg/L
3 4 2/11/2004 NH3 Ammonia-Nitrogen 0.014 mg/L
4 11 2/11/2004 Chl a Chlorophyll a 1.310 mg/m3
5 10 2/11/2004 Chl a Chlorophyll a 1.390 mg/m3
接下来,我们将应用 pivot_wider
来传播 Analysis
变量。这将为您留下每个 Analysis
类型的列,以及它们各自的 Results
值。
#spread the analysis variable
new_df <- df %>%
pivot_wider(names_from = "Analysis", values_from = "Results")
Site Sample_Date Analysis_code Units `Ammonia-Nitrogen` `Chlorophyll a`
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 2/11/2004 NH3 mg/L 0.068 NA
2 2 2/11/2004 NH3 mg/L 0.07 NA
3 4 2/11/2004 NH3 mg/L 0.014 NA
4 11 2/11/2004 Chl a mg/m3 NA 1.31
5 10 2/11/2004 Chl a mg/m3 NA 1.39
我有一个数据集(原版,# A tibble: 33,478 x 12
),形式类似附图,部分数据:
dput(head(canals2, n=10))
structure(list(Site = c(1, 2, 4, 11, 10, 12, 13, 14, 15, 16),
`Sample Date` = c("2/11/2004", "2/11/2004", "2/11/2004",
"2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004",
"2/11/2004", "2/11/2004"), `Analysis code` = c("NH3", "NH3",
"NH3", "Chl a", "Chl a", "Chl a", "NH3", "Chl a", "NH3",
"NH3"), Analysis = c("Ammonia-Nitrogen", "Ammonia-Nitrogen",
"Ammonia-Nitrogen", "Chlorophyll a", "Chlorophyll a", "Chlorophyll a",
"Ammonia-Nitrogen", "Chlorophyll a", "Ammonia-Nitrogen",
"Ammonia-Nitrogen"), Result = c(0.068, 0.07, 0.014, 1.31,
1.39, 1.95, 0.247, 1.46, 0.113, 0.17), Units = c("mg/L",
"mg/L", "mg/L", "mg/m3", "mg/m3", "mg/m3", "mg/L", "mg/m3",
"mg/L", "mg/L")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
我想尝试使用线性模型(例如,使用 lm()
函数)从“氨氮”中预测“叶绿素 a”。 lm()
将列名作为 'formula' 的输入,但此数据集的生成方式非常不同。我应该为每个分析使用 Results
列中的值,但我似乎找不到组织数据的好方法。
到目前为止,我尝试按分析拆分数据,目的是为每个分析创建一个新的数据框,然后将 Result
替换为该数据框中所选分析的名称。这是我使用的函数(运行 它在主数据集上,这就是它包含更多分析名称的原因):
analysis_list = unique(canals$Analysis)
> analysis_list
1“氨氮”“叶绿素 a”“粪大肠菌群”
[4]“电导率”“铜”“溶解氧”
[7]“大肠杆菌”“肠球菌”“亚硝酸盐+硝酸盐”
[10]“正磷酸盐”“pH”“盐度”
[13]“温度”“总凯氏氮”“总氮”
[16]“总磷”“浊度”
split_analyses <- function()
canals_byAnalysis <- vector(mode = "list", length = 0)
for (i in 1:17) {
analysis <- analysis_list[i]
updated_analysis <- canals %>%
subset(Analysis == analysis,
select = -c(`Analysis code`))
canals_byAnalysis[[i]] <- updated_analysis
}
split_analyses()
不幸的是,这没有按预期工作,我在合并我创建的表时遇到了很多问题。我也尝试了其他方法,但我无处可去。有没有人愿意提供一些建议?
如果我没理解错的话,那么听起来您正在尝试重组数据以将其转换为适合建模的正确形式。我认为使用 pivot_wider
(来自 tidyr
)会得到你想要的。这是我所做的:
首先,这是您作为数据框的数据:
Site <- c(1, 2, 4, 11, 10, 12, 13, 14, 15, 16)
Sample_Date <- c("2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004",
"2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004")
Analysis_code <- c("NH3", "NH3", "NH3", "Chl a", "Chl a", "Chl a", "NH3", "Chl
a", "NH3", "NH3")
Analysis <- c("Ammonia-Nitrogen", "Ammonia-Nitrogen", "Ammonia-Nitrogen",
"Chlorophyll a", "Chlorophyll a", "Chlorophyll a", "Ammonia-Nitrogen",
"Chlorophyll a", "Ammonia-Nitrogen", "Ammonia-Nitrogen")
Results <- c(0.068, 0.07, 0.014, 1.31, 1.39, 1.95, 0.247, 1.46, 0.113, 0.17)
Units <- c("mg/L", "mg/L", "mg/L", "mg/m3", "mg/m3", "mg/m3", "mg/L", "mg/m3",
"mg/L", "mg/L")
Site Sample_Date Analysis_code Analysis Results Units
1 1 2/11/2004 NH3 Ammonia-Nitrogen 0.068 mg/L
2 2 2/11/2004 NH3 Ammonia-Nitrogen 0.070 mg/L
3 4 2/11/2004 NH3 Ammonia-Nitrogen 0.014 mg/L
4 11 2/11/2004 Chl a Chlorophyll a 1.310 mg/m3
5 10 2/11/2004 Chl a Chlorophyll a 1.390 mg/m3
接下来,我们将应用 pivot_wider
来传播 Analysis
变量。这将为您留下每个 Analysis
类型的列,以及它们各自的 Results
值。
#spread the analysis variable
new_df <- df %>%
pivot_wider(names_from = "Analysis", values_from = "Results")
Site Sample_Date Analysis_code Units `Ammonia-Nitrogen` `Chlorophyll a`
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 2/11/2004 NH3 mg/L 0.068 NA
2 2 2/11/2004 NH3 mg/L 0.07 NA
3 4 2/11/2004 NH3 mg/L 0.014 NA
4 11 2/11/2004 Chl a mg/m3 NA 1.31
5 10 2/11/2004 Chl a mg/m3 NA 1.39