重塑 R 中的数据?

Reshape data in R?

我有不同站点的平均每日数据,如图所示 in figure 1 in this folder

但是,我想将这些数据整理成 同一文件夹中的图 2。

使用 this code 重塑数据,但最终值 (reshpae_stage_R.csv) 与原始值不匹配。

通过运行第二次的代码,我得到了这个错误:

Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L],  : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA.January’ 

你能帮我看看为什么最终值与原始值不匹配吗?

提前致谢

更新:

感谢@aelwan 发现bug,更新代码如下:

library(ggplot2)
library(reshape2)

# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)

# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars

# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear,  1, drop = FALSE]

# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]

# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1], 
           regexpr("(?<=(Stage\s\(mm\)\sat\s))([A-Za-z\s0-9\.]+)", dfSiteYear[, 1], perl = TRUE))

# add the site and years
dfSiteYearLong = dfSiteYear[rep(1:dim(dfSiteYear)[1], each = 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)

# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"), 
                                  measure.vars = dateVars[-1],
                        variable.name = "Month")
dfStageFinalWide = reshape2::dcast(dfStageFinalLong, Day + Month + Year ~ Site, 
                                   value.var = "value")

# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)

# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide, 
                             as.Date(paste(Day, Month, Year, sep = "-"), 
                                     format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]

# plot the values over time
dfStageFinalLong = 
  reshape2::melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
       variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
  geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y") 

这导致下图:


原回答:

此示例需要相当多的数据处理技能。您基本上必须注意数据中的重复模式——数据是按日 x 月 table 组织的 site-year 测量值。


食谱:

这是创建所需数据集的方法:
1. 去掉数据中多余的行&列。
2. 使用模式匹配 (grep).
提取标识 table 的年份和站点的行 3. 从较长的字符串中,使用正则表达式(regexprregmatches)提取年份和站点名称。
4. 为每个 site-year 组合找到 table 的起始行索引,并将刚刚提取的 site-year 名称分配给与该站点和年份对应的所有行。
5. 现在你可以继续把它改造成你想要的任何形状。在下面的代码中,行标识符是年月日,列是站点。
6. 一些清理工作,一切顺利。


代码:

这是上述食谱的代码:

# read in the data
dfStage = read.csv("reshapeR/Data/stage.csv", header = FALSE, stringsAsFactor = FALSE)

# remove the rows which are min, max, mean & redundant columns
condMMM = stringr::str_trim(dfStage[, 1]) %in% c("Min", "Max", "Mean", "Day")
dfStage = dfStage[!condMMM, 1:13]
dateVars = c("Day", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
colnames(dfStage) = dateVars

# get indices & names of year site combinations
condlSiteYear = grepl("^Daily means", stringr::str_trim(dfStage[, 1]))
condiSiteYear = grep("^Daily means", stringr::str_trim(dfStage[, 1]))
dfSiteYear = dfStage[condlSiteYear,  1, drop = FALSE]

# remove site-year rows from data
dfStage = dfStage[!condlSiteYear, ]

# get the list of sites and years
dfSiteYear$Year = regmatches(dfSiteYear[, 1], regexpr("(?<=Year\s)([0-9]+)", dfSiteYear[, 1], perl = TRUE))
dfSiteYear$Site = regmatches(dfSiteYear[, 1], 
           regexpr("(?<=(Stage\s\(mm\)\sat\s))([A-Za-z\s0-9\.]+)", dfSiteYear[, 1], perl = TRUE))

# add the site and years
dfSiteYearLong = dfSiteYear[rep.int(1:dim(dfSiteYear)[1], 31), c("Site", "Year")]
dfStageFinal = cbind(dfStage, dfSiteYearLong)

# reshape
dfStageFinalLong = reshape2::melt(dfStageFinal, id.vars = c("Day", "Site", "Year"), measure.vars = dateVars[-1],
                        variable.name = "Month")
dfStageFinalWide = dcast(dfStageFinalLong, Day + Month + Year ~ Site, value.var = "value")

# cleanup
dfStageFinalWide[, -c(1:3)] = lapply(dfStageFinalWide[, -c(1:3)], as.numeric)

# create a date variable
dfStageFinalWide$Date = with(dfStageFinalWide, 
                             as.Date(paste(Day, Month, Year, sep = "-"), 
                                     format = "%d-%b-%Y"))
# remove the infeasible dates
dfStageFinalWide = dfStageFinalWide[!is.na(dfStageFinalWide$Date), ]
dfStageFinalWide = dfStageFinalWide[order(dfStageFinalWide$Date), ]

# plot the values over time
dfStageFinalLong = 
  melt(dfStageFinalWide, id.vars = "Date", measure.vars = unique(dfSiteYear$Site),
       variable.name = "Site")
ggplot(dfStageFinalLong, aes(x = Date, y = value, color = Site))+
  geom_line() + theme_bw() + facet_wrap(~ Site, scale = "free_y") 

输出:

输出结果如下:

> head(dfStageFinalWide)
  Day Month Year Kumeti at Te Rehunga Makakahi at Hamua Makuri at Tuscan Hills Manawatu at Hopelands Manawatu at Upper Gorge Manawatu at Weber Road Mangahao at Ballance
1   1   Jan 1990                  454                NA                    700                  5133                      NA                     NA                   NA
2   1   Jan 1991                 1002              3643                   1416                    50                    3597                   1836                18160
3   1   Jan 1992                 3490             34239                   8922                  3049                    1221                    417                   NA
4   1   Jan 1993                  404                NA                    396                  3408                      NA                    272                   NA
5   1   Jan 1994                   NA                NA                   3189                   795                      NA                   2321                 1889
6   1   Jan 1995                16548              1923                  69862                  4808                      NA                   6169                   94
  Mangapapa at Troup Rd Mangatainoka at Larsons Road Mangatainoka at Pahiatua Town Bridge Mangatainoka at Tararua Park Mangatoro at Mangahei Road Oruakeretaki at S.H.2 Napier
1                  9406                         2767                                   NA                           NA                       6838                         2831
2                  4985                         2479                                  823                         1078                         76                          105
3                   478                         3665                                 1415                          210                        394                         8247
4                  6394                         1298                                   NA                         2668                       3837                         1878
5                 14051                         3561                                   NA                         2645                        807                           NA
6                    NA                         1057                                 7029                         4497                         NA                           NA
  Raparapawai at Jackson Rd Tamaki at Stephensons Tiraumea at Ngaturi
1                      5189                 50444               17951
2                       345                   416                3025
3                      1364                  5713                1710
4                      3457                 28078                8670
5                       199                    NA                 292
6                        NA                    NA               22774

还有一张图片将它们组合在一起。