在 R 中,平均行值直到达到特定条件,然后重新启动,并在新列中输出

In R, average row value until hit a specific condition, then restart, with output in new column

我正在处理 GPS 数据,并试图弄清楚如何对纬度和经度的第 11-15 次定位进行平均。我在类似问题中看到了如何对每 n 行进行平均的解决方案。问题是偶尔卫星会爆炸并且修复停止在 13 或 14。所以,在这些情况下,我只想平均 3 或 4 个值而不是 5。所以我正在寻找从开始的纬度和经度的平均值其中系列中的数字为 11,直到系列中的数字再次下降(或者只要它在增加?我需要它包括最后一组,它不会再次下降到较低的数字)。我首先删除了系列中数字不在我想要的 11-15 范围内的所有行。所以,对于一个示例虚拟数据集,这给我留下了:

      Date      Time     Long       Lat     NoInSeries
12  17/11/2014 22:09:17 115.9508 -31.82850    11
13  17/11/2014 22:09:18 115.9508 -31.82846    12
14  17/11/2014 22:09:19 115.9513 -31.82864    13
15  17/11/2014 22:09:21 115.9511 -31.82863    14
26  18/11/2014 00:07:14 115.9509 -31.82829    11
27  18/11/2014 00:07:15 115.9509 -31.82829    12
28  18/11/2014 00:07:16 115.9509 -31.82830    13
29  18/11/2014 00:07:17 115.9509 -31.82830    14
30  18/11/2014 00:07:18 115.9509 -31.82831    15
56  18/11/2014 10:00:24 115.9513 -31.82670    11
57  18/11/2014 10:00:25 115.9514 -31.82670    12
58  18/11/2014 10:00:26 115.9514 -31.82669    13
59  18/11/2014 10:00:27 115.9514 -31.82668    14
60  18/11/2014 10:00:28 115.9514 -31.82668    15

我想要的输出是这样的,第一个平均 4 (11-14),接下来两个平均 5 (11-15):

     Date      Time     Long       Lat     NoInSeries  AvgLong     Avg Lat
12  17/11/2014 22:09:17 115.9508 -31.82850    11       115.9510   -31.82856
13  17/11/2014 22:09:18 115.9508 -31.82846    12          NA          NA
14  17/11/2014 22:09:19 115.9513 -31.82864    13          NA          NA
15  17/11/2014 22:09:21 115.9511 -31.82863    14          NA          NA
26  18/11/2014 00:07:14 115.9509 -31.82829    11       115.9509   -31.82830
27  18/11/2014 00:07:15 115.9509 -31.82829    12          NA          NA
28  18/11/2014 00:07:16 115.9509 -31.82830    13          NA          NA
29  18/11/2014 00:07:17 115.9509 -31.82830    14          NA          NA
30  18/11/2014 00:07:18 115.9509 -31.82831    15          NA          NA
56  18/11/2014 10:00:24 115.9513 -31.82670    11       115.9514   -31.82669
57  18/11/2014 10:00:25 115.9514 -31.82670    12          NA          NA
58  18/11/2014 10:00:26 115.9514 -31.82669    13          NA          NA
59  18/11/2014 10:00:27 115.9514 -31.82668    14          NA          NA
60  18/11/2014 10:00:28 115.9514 -31.82668    15          NA          NA

然后我将遍历并删除 AvgLong==NA 的所有行,因此我的最终输出将只包含 number in series=11 的所有行的平均值。

我真的不知道从哪里开始编写代码...我发现的示例都讨论了对精确的行数进行平均,而不是对可变数进行平均。

例如:

c( tapply( x, (row(x)-1)%/%5, mean ) )

或:

idx <- ceiling(seq_len(nrow(dd)) / 5)
# do colMeans on all columns except last one.
res <- lapply(split(dd[-(ncol(dd))], idx), colMeans, na.rm = TRUE)
# assign first value of "datetime" in each 5-er group as names to list
names(res) <- dd$datetime[seq(1, nrow(df), by=5)]
# bind them to give a matrix
res <- do.call(rbind, res)

此外,我看到的答案一般都会将平均值输出为新的数据框...最终,我也想在一个条件下进行平均:如果时间表是 'Multifix',我想平均 11 到它上升到 15 时的平均数,而如果时间表是 'Continuous',我想从 181 到平均,直到每个上升到多高......)。像这样:

if(import.list$Schedule=='Multifix'){
...code to average Long and Lat for Number in Series from 11 up to however high it goes (up to 15)...
} else {
...code to average Long and Lat for Number in Series from 241 up to however high it goes...
}

或者也许我有一个 if else 语句来定义一个变量,然后在函数中使用该变量来进行平均?

...但我想如果输出创建一个新的数据框,这种情况会使事情复杂化,这就是为什么我的目标是只向新列添加值 "AvgLong" 和 "AvgLat." 谢谢任何帮助!!

#dput 函数显示我根据您的问题处理的数据。

dput(df1)
structure(list(ID = c(12L, 13L, 14L, 15L, 26L, 27L, 28L, 29L, 
30L, 56L, 57L, 58L, 59L, 60L), Date = c("17/11/2014", "17/11/2014", 
"17/11/2014", "17/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", 
"18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", 
"18/11/2014", "18/11/2014"), Time = c("22:09:17", "22:09:18", 
"22:09:19", "22:09:21", "00:07:14", "00:07:15", "00:07:16", "00:07:17", 
"00:07:18", "10:00:24", "10:00:25", "10:00:26", "10:00:27", "10:00:28"
), Long = c(115.9508, 115.9508, 115.9513, 115.9511, 115.9509, 
115.9509, 115.9509, 115.9509, 115.9509, 115.9513, 115.9514, 115.9514, 
115.9514, 115.9514), Lat = c(-31.8285, -31.82846, -31.82864, 
-31.82863, -31.82829, -31.82829, -31.8283, -31.8283, -31.82831, 
-31.8267, -31.8267, -31.82669, -31.82668, -31.82668), NoInSeries = c(11L, 
12L, 13L, 14L, 11L, 12L, 13L, 14L, 15L, 11L, 12L, 13L, 14L, 15L
)), .Names = c("ID", "Date", "Time", "Long", "Lat", "NoInSeries"
), class = "data.frame", row.names = c(NA, -14L))

#get.counter 当列的值开始减少而不是升序时获取行索引。

get.counter <- function(x){
  a1 = x
  counter = 0
  a2 = c()
  for( i in 1:length(a1)){  
    if(i < length(a1)){
      if(a1[i+1] > a1[i]){
        counter = counter + 1
      }else{
        counter = counter + 1
        a2 = c(a2, counter)
        counter = 0
      }
    }else{
      counter = counter + 1
      a2 = c(a2, counter)
    }
  }
  return(a2)
}

# avg.seg.col 函数输出一个数据框,其中包含列的分段平均值。 df1 是输入数据框,colvar 是列名(例如:Long 或 Lat),get_counter 是 get.counter 函数的输出。

avg.seg.col <- function(df1, colvar, get_counter){ 

  long <- c()

  start = 1

  for(i in cumsum(get_counter)){
    end = i
    b1 = subset(df1, select = colvar)[start:end,]

    mean_b1 = mean(b1)

    long = c(long, mean_b1, rep(NA, (length(b1)-1)))

    start = end+1
  }
  return(data.frame(long, stringsAsFactors = FALSE))
}

# 使用 read.table 函数从文本文件中读取数据。您需要确保您的文件存在于当前工作目录中。工作目录可以通过setwd("path of current working directory")

设置
df1 <- read.table(file = "file1.txt", 
                  header = TRUE, 
                  sep = "\t", 
                  stringsAsFactors = FALSE)

# 使用来自 df1$NoInSeries

的向量应用 get.counter 函数
get_counter <- get.counter(df1$NoInSeries)

# 对长列应用 avg.seg.col 函数

AvgLong <- avg.seg.col(df1, "Long", get_counter)

# 对 Lat 列应用 avg.seg.col 函数

AvgLat <- avg.seg.col(df1, "Lat", get_counter)

# 按列合并数据帧

df2 <- do.call("cbind", list(df1, AvgLong, AvgLat))

# 分配列名

colnames(df2) <- c(colnames(df2)[1:(ncol(df2)-2)], "AvgLong", "AvgLat")

输出:

     print(df2)
   ID       Date     Time     Long       Lat NoInSeries  AvgLong    AvgLat
1  12 17/11/2014 22:09:17 115.9508 -31.82850         11 115.9510 -31.82856
2  13 17/11/2014 22:09:18 115.9508 -31.82846         12       NA        NA
3  14 17/11/2014 22:09:19 115.9513 -31.82864         13       NA        NA
4  15 17/11/2014 22:09:21 115.9511 -31.82863         14       NA        NA
5  26 18/11/2014 00:07:14 115.9509 -31.82829         11 115.9509 -31.82830
6  27 18/11/2014 00:07:15 115.9509 -31.82829         12       NA        NA
7  28 18/11/2014 00:07:16 115.9509 -31.82830         13       NA        NA
8  29 18/11/2014 00:07:17 115.9509 -31.82830         14       NA        NA
9  30 18/11/2014 00:07:18 115.9509 -31.82831         15       NA        NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670         11 115.9514 -31.82669
11 57 18/11/2014 10:00:25 115.9514 -31.82670         12       NA        NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669         13       NA        NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668         14       NA        NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668         15       NA        NA

#删除带有 NA 的行后,输出如下所示

df2[-(which(df2$AvgLong %in% NA)), ]
   ID       Date     Time     Long       Lat NoInSeries  AvgLong    AvgLat
1  12 17/11/2014 22:09:17 115.9508 -31.82850         11 115.9510 -31.82856
5  26 18/11/2014 00:07:14 115.9509 -31.82829         11 115.9509 -31.82830
10 56 18/11/2014 10:00:24 115.9513 -31.82670         11 115.9514 -31.82669

似乎使用 aggregate 完成了大部分工作:

> aggregate(df1[ ,c("ID", "Long","Lat")], list( (df1$ID-1) %/% 5), mean)
  Group.1   ID     Long       Lat
1       2 13.5 115.9510 -31.82856
2       5 28.0 115.9509 -31.82830
3      11 58.0 115.9514 -31.82669

需要将 ID 变量移动 1 以获得模除以提供您想要的组。如果你想让某些东西与原始数据保持一致,那么 ave 函数旨在提供:

> df1$aveLong <- ave( df1$Long, (df1$ID-1) %/% 5, 
          FUN=function(x) c( mean(x), rep(NA, length(x)-1) ) )
> df1$aveLLat <- ave( df1$Lat, (df1$ID-1) %/% 5, 
          FUN=function(x) c( mean(x), rep(NA, length(x)-1) ) )
> df1
   ID       Date     Time     Long       Lat NoInSeries  aveLong
1  12 17/11/2014 22:09:17 115.9508 -31.82850         11 115.9510
2  13 17/11/2014 22:09:18 115.9508 -31.82846         12       NA
3  14 17/11/2014 22:09:19 115.9513 -31.82864         13       NA
4  15 17/11/2014 22:09:21 115.9511 -31.82863         14       NA
5  26 18/11/2014 00:07:14 115.9509 -31.82829         11 115.9509
6  27 18/11/2014 00:07:15 115.9509 -31.82829         12       NA
7  28 18/11/2014 00:07:16 115.9509 -31.82830         13       NA
8  29 18/11/2014 00:07:17 115.9509 -31.82830         14       NA
9  30 18/11/2014 00:07:18 115.9509 -31.82831         15       NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670         11 115.9514
11 57 18/11/2014 10:00:25 115.9514 -31.82670         12       NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669         13       NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668         14       NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668         15       NA
     aveLLat
1  -31.82856
2         NA
3         NA
4         NA
5  -31.82830
6         NA
7         NA
8         NA
9         NA
10 -31.82669
11        NA
12        NA
13        NA
14        NA

您可以使用 cumsumdiffaggregatemerge

x
##          Date     Time     Long       Lat NoInSeries SeriesNo
## 1  17/11/2014 22:09:17 115.9508 -31.82850         11        0
## 2  17/11/2014 22:09:18 115.9508 -31.82846         12        0
## 3  17/11/2014 22:09:19 115.9513 -31.82864         13        0
## 4  17/11/2014 22:09:21 115.9511 -31.82863         14        0
## 5  18/11/2014 00:07:14 115.9509 -31.82829         11        1
## 6  18/11/2014 00:07:15 115.9509 -31.82829         12        1
## 7  18/11/2014 00:07:16 115.9509 -31.82830         13        1
## 8  18/11/2014 00:07:17 115.9509 -31.82830         14        1
## 9  18/11/2014 00:07:18 115.9509 -31.82831         15        1
## 10 18/11/2014 10:00:24 115.9513 -31.82670         11        2
## 11 18/11/2014 10:00:25 115.9514 -31.82670         12        2
## 12 18/11/2014 10:00:26 115.9514 -31.82669         13        2
## 13 18/11/2014 10:00:27 115.9514 -31.82668         14        2
## 14 18/11/2014 10:00:28 115.9514 -31.82668         15        2

cumsum(c(0, diff(x$NoInSeries) < 0)) 将为您提供一个新列,每次 NoInSeriesdiff 为负数时都会增加。

# Define a new variable which increments after every drop in NoInSeries
x$SeriesNo <- cumsum(c(0, diff(x$NoInSeries) < 0))

现在您 aggregate 使用新的 SeriesNo

# Breakdown ...  First aggregate Long, Lat by Series No with Function mean
aggregate(cbind(Long, Lat) ~ SeriesNo, data = x, FUN = mean)
##   SeriesNo     Long       Lat
## 1        0 115.9510 -31.82856
## 2        1 115.9509 -31.82830
## 3        2 115.9514 -31.82669



# merge it back with original data with only rows where NoInSeries = 11

# Final Desired Result in one line
merge(x[x$NoInSeries == 11, c("Date", "Time", "SeriesNo")], aggregate(cbind(Long, 
    Lat) ~ SeriesNo, data = x, FUN = mean))
##   SeriesNo       Date     Time     Long       Lat
## 1        0 17/11/2014 22:09:17 115.9510 -31.82856
## 2        1 18/11/2014 00:07:14 115.9509 -31.82830
## 3        2 18/11/2014 10:00:24 115.9514 -31.82669

我读过 for 循环对于迭代操作是必要的,这就是我喜欢 Chinmay 使用 cumsumdiff 的原因。我没有足够的声誉来评论@Chinmay Patil 的优雅回答,所以这里有一个稍微不同的方法。

df$group <- 0     #Create a dummy grouping variable

for(i in 2:length(df$NoInSeries)) {        #Starting on row 2 to the end
  #Check if the series resets (True = 1, False = 0)
  check <- df[i-1, "NoInSeries"] > df[i, "NoInSeries"]  
  df[i, "group"] <- df[i-1, "group"] + check    #Add check value to previous row
}     #This yields a number for each series

require(plyr)
ddply(df, .(group), summarise, 
    Date= min(Date), Time=min(Time), Long=mean(Long), Lat= mean(Lat))

#  group       Date     Time     Long       Lat
#1     0 17/11/2014 22:09:17 115.9510 -31.82856
#2     1 18/11/2014 00:07:14 115.9509 -31.82830
#3     2 18/11/2014 10:00:24 115.9514 -31.82669

您可以按第一次(min,如上)、最后一次(max)或平均时间(mean)报告Lat/Lon。但是,当我在数据框中有 POSIXct dates/times 时,有时 ddply 会出现问题。