重塑 R 中的数据?
Reshape data in R?
我有不同站点的平均每日数据,如图所示 in figure 1 in this folder。
但是,我想将这些数据整理成 同一文件夹中的图 2。
使用 this code 重塑数据,但最终值 (reshpae_stage_R.csv) 与原始值不匹配。
通过运行第二次的代码,我得到了这个错误:
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L], :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA.January’
你能帮我看看为什么最终值与原始值不匹配吗?
提前致谢
更新:
感谢@aelwan 发现bug,更新代码如下:
library(ggplot2)
library(reshape2)
# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)
# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars
# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear, 1, drop = FALSE]
# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]
# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1],
regexpr("(?<=(Stage\s\(mm\)\sat\s))([A-Za-z\s0-9\.]+)", dfSiteYear[, 1], perl = TRUE))
# add the site and years
dfSiteYearLong = dfSiteYear[rep(1:dim(dfSiteYear)[1], each = 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)
# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"),
measure.vars = dateVars[-1],
variable.name = "Month")
dfStageFinalWide = reshape2::dcast(dfStageFinalLong, Day + Month + Year ~ Site,
value.var = "value")
# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)
# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide,
as.Date(paste(Day, Month, Year, sep = "-"),
format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]
# plot the values over time
dfStageFinalLong =
reshape2::melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y")
这导致下图:
原回答:
此示例需要相当多的数据处理技能。您基本上必须注意数据中的重复模式——数据是按日 x 月 table 组织的 site-year 测量值。
食谱:
这是创建所需数据集的方法:
1. 去掉数据中多余的行&列。
2. 使用模式匹配 (grep
).
提取标识 table 的年份和站点的行
3. 从较长的字符串中,使用正则表达式(regexpr
和 regmatches
)提取年份和站点名称。
4. 为每个 site-year 组合找到 table 的起始行索引,并将刚刚提取的 site-year 名称分配给与该站点和年份对应的所有行。
5. 现在你可以继续把它改造成你想要的任何形状。在下面的代码中,行标识符是年月日,列是站点。
6. 一些清理工作,一切顺利。
代码:
这是上述食谱的代码:
# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)
# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars
# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear, 1, drop = FALSE]
# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]
# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1],
regexpr("(?<=(Stage\s\(mm\)\sat\s))([A-Za-z\s0-9\.]+)", dfSiteYear[, 1], perl = TRUE))
# add the site and years
dfSiteYearLong = dfSiteYear[rep.int(1:dim(dfSiteYear)[1], 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)
# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"), measure.vars = dateVars[-1],
variable.name = "Month")
dfStageFinalWide = dcast(dfStageFinalLong, Day + Month + Year ~ Site, value.var = "value")
# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)
# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide,
as.Date(paste(Day, Month, Year, sep = "-"),
format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]
# plot the values over time
dfStageFinalLong =
melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y")
输出:
输出结果如下:
> head(dfStageFinalWide)
Day Month Year Kumeti at Te Rehunga Makakahi at Hamua Makuri at Tuscan Hills Manawatu at Hopelands Manawatu at Upper Gorge Manawatu at Weber Road Mangahao at Ballance
1 1 Jan 1990 454 NA 700 5133 NA NA NA
2 1 Jan 1991 1002 3643 1416 50 3597 1836 18160
3 1 Jan 1992 3490 34239 8922 3049 1221 417 NA
4 1 Jan 1993 404 NA 396 3408 NA 272 NA
5 1 Jan 1994 NA NA 3189 795 NA 2321 1889
6 1 Jan 1995 16548 1923 69862 4808 NA 6169 94
Mangapapa at Troup Rd Mangatainoka at Larsons Road Mangatainoka at Pahiatua Town Bridge Mangatainoka at Tararua Park Mangatoro at Mangahei Road Oruakeretaki at S.H.2 Napier
1 9406 2767 NA NA 6838 2831
2 4985 2479 823 1078 76 105
3 478 3665 1415 210 394 8247
4 6394 1298 NA 2668 3837 1878
5 14051 3561 NA 2645 807 NA
6 NA 1057 7029 4497 NA NA
Raparapawai at Jackson Rd Tamaki at Stephensons Tiraumea at Ngaturi
1 5189 50444 17951
2 345 416 3025
3 1364 5713 1710
4 3457 28078 8670
5 199 NA 292
6 NA NA 22774
还有一张图片将它们组合在一起。
我有不同站点的平均每日数据,如图所示 in figure 1 in this folder。
但是,我想将这些数据整理成 同一文件夹中的图 2。
使用 this code 重塑数据,但最终值 (reshpae_stage_R.csv) 与原始值不匹配。
通过运行第二次的代码,我得到了这个错误:
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L], :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA.January’
你能帮我看看为什么最终值与原始值不匹配吗?
提前致谢
更新:
感谢@aelwan 发现bug,更新代码如下:
library(ggplot2)
library(reshape2)
# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)
# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars
# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear, 1, drop = FALSE]
# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]
# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1],
regexpr("(?<=(Stage\s\(mm\)\sat\s))([A-Za-z\s0-9\.]+)", dfSiteYear[, 1], perl = TRUE))
# add the site and years
dfSiteYearLong = dfSiteYear[rep(1:dim(dfSiteYear)[1], each = 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)
# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"),
measure.vars = dateVars[-1],
variable.name = "Month")
dfStageFinalWide = reshape2::dcast(dfStageFinalLong, Day + Month + Year ~ Site,
value.var = "value")
# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)
# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide,
as.Date(paste(Day, Month, Year, sep = "-"),
format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]
# plot the values over time
dfStageFinalLong =
reshape2::melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y")
这导致下图:
原回答:
此示例需要相当多的数据处理技能。您基本上必须注意数据中的重复模式——数据是按日 x 月 table 组织的 site-year 测量值。
食谱:
这是创建所需数据集的方法:
1. 去掉数据中多余的行&列。
2. 使用模式匹配 (grep
).
提取标识 table 的年份和站点的行
3. 从较长的字符串中,使用正则表达式(regexpr
和 regmatches
)提取年份和站点名称。
4. 为每个 site-year 组合找到 table 的起始行索引,并将刚刚提取的 site-year 名称分配给与该站点和年份对应的所有行。
5. 现在你可以继续把它改造成你想要的任何形状。在下面的代码中,行标识符是年月日,列是站点。
6. 一些清理工作,一切顺利。
代码:
这是上述食谱的代码:
# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)
# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars
# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear, 1, drop = FALSE]
# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]
# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1],
regexpr("(?<=(Stage\s\(mm\)\sat\s))([A-Za-z\s0-9\.]+)", dfSiteYear[, 1], perl = TRUE))
# add the site and years
dfSiteYearLong = dfSiteYear[rep.int(1:dim(dfSiteYear)[1], 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)
# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"), measure.vars = dateVars[-1],
variable.name = "Month")
dfStageFinalWide = dcast(dfStageFinalLong, Day + Month + Year ~ Site, value.var = "value")
# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)
# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide,
as.Date(paste(Day, Month, Year, sep = "-"),
format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]
# plot the values over time
dfStageFinalLong =
melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y")
输出:
输出结果如下:
> head(dfStageFinalWide)
Day Month Year Kumeti at Te Rehunga Makakahi at Hamua Makuri at Tuscan Hills Manawatu at Hopelands Manawatu at Upper Gorge Manawatu at Weber Road Mangahao at Ballance
1 1 Jan 1990 454 NA 700 5133 NA NA NA
2 1 Jan 1991 1002 3643 1416 50 3597 1836 18160
3 1 Jan 1992 3490 34239 8922 3049 1221 417 NA
4 1 Jan 1993 404 NA 396 3408 NA 272 NA
5 1 Jan 1994 NA NA 3189 795 NA 2321 1889
6 1 Jan 1995 16548 1923 69862 4808 NA 6169 94
Mangapapa at Troup Rd Mangatainoka at Larsons Road Mangatainoka at Pahiatua Town Bridge Mangatainoka at Tararua Park Mangatoro at Mangahei Road Oruakeretaki at S.H.2 Napier
1 9406 2767 NA NA 6838 2831
2 4985 2479 823 1078 76 105
3 478 3665 1415 210 394 8247
4 6394 1298 NA 2668 3837 1878
5 14051 3561 NA 2645 807 NA
6 NA 1057 7029 4497 NA NA
Raparapawai at Jackson Rd Tamaki at Stephensons Tiraumea at Ngaturi
1 5189 50444 17951
2 345 416 3025
3 1364 5713 1710
4 3457 28078 8670
5 199 NA 292
6 NA NA 22774
还有一张图片将它们组合在一起。