在 R 中,平均行值直到达到特定条件,然后重新启动,并在新列中输出
In R, average row value until hit a specific condition, then restart, with output in new column
我正在处理 GPS 数据,并试图弄清楚如何对纬度和经度的第 11-15 次定位进行平均。我在类似问题中看到了如何对每 n 行进行平均的解决方案。问题是偶尔卫星会爆炸并且修复停止在 13 或 14。所以,在这些情况下,我只想平均 3 或 4 个值而不是 5。所以我正在寻找从开始的纬度和经度的平均值其中系列中的数字为 11,直到系列中的数字再次下降(或者只要它在增加?我需要它包括最后一组,它不会再次下降到较低的数字)。我首先删除了系列中数字不在我想要的 11-15 范围内的所有行。所以,对于一个示例虚拟数据集,这给我留下了:
Date Time Long Lat NoInSeries
12 17/11/2014 22:09:17 115.9508 -31.82850 11
13 17/11/2014 22:09:18 115.9508 -31.82846 12
14 17/11/2014 22:09:19 115.9513 -31.82864 13
15 17/11/2014 22:09:21 115.9511 -31.82863 14
26 18/11/2014 00:07:14 115.9509 -31.82829 11
27 18/11/2014 00:07:15 115.9509 -31.82829 12
28 18/11/2014 00:07:16 115.9509 -31.82830 13
29 18/11/2014 00:07:17 115.9509 -31.82830 14
30 18/11/2014 00:07:18 115.9509 -31.82831 15
56 18/11/2014 10:00:24 115.9513 -31.82670 11
57 18/11/2014 10:00:25 115.9514 -31.82670 12
58 18/11/2014 10:00:26 115.9514 -31.82669 13
59 18/11/2014 10:00:27 115.9514 -31.82668 14
60 18/11/2014 10:00:28 115.9514 -31.82668 15
我想要的输出是这样的,第一个平均 4 (11-14),接下来两个平均 5 (11-15):
Date Time Long Lat NoInSeries AvgLong Avg Lat
12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
13 17/11/2014 22:09:18 115.9508 -31.82846 12 NA NA
14 17/11/2014 22:09:19 115.9513 -31.82864 13 NA NA
15 17/11/2014 22:09:21 115.9511 -31.82863 14 NA NA
26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
27 18/11/2014 00:07:15 115.9509 -31.82829 12 NA NA
28 18/11/2014 00:07:16 115.9509 -31.82830 13 NA NA
29 18/11/2014 00:07:17 115.9509 -31.82830 14 NA NA
30 18/11/2014 00:07:18 115.9509 -31.82831 15 NA NA
56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
57 18/11/2014 10:00:25 115.9514 -31.82670 12 NA NA
58 18/11/2014 10:00:26 115.9514 -31.82669 13 NA NA
59 18/11/2014 10:00:27 115.9514 -31.82668 14 NA NA
60 18/11/2014 10:00:28 115.9514 -31.82668 15 NA NA
然后我将遍历并删除 AvgLong==NA 的所有行,因此我的最终输出将只包含 number in series=11 的所有行的平均值。
我真的不知道从哪里开始编写代码...我发现的示例都讨论了对精确的行数进行平均,而不是对可变数进行平均。
例如:
c( tapply( x, (row(x)-1)%/%5, mean ) )
或:
idx <- ceiling(seq_len(nrow(dd)) / 5)
# do colMeans on all columns except last one.
res <- lapply(split(dd[-(ncol(dd))], idx), colMeans, na.rm = TRUE)
# assign first value of "datetime" in each 5-er group as names to list
names(res) <- dd$datetime[seq(1, nrow(df), by=5)]
# bind them to give a matrix
res <- do.call(rbind, res)
此外,我看到的答案一般都会将平均值输出为新的数据框...最终,我也想在一个条件下进行平均:如果时间表是 'Multifix',我想平均 11 到它上升到 15 时的平均数,而如果时间表是 'Continuous',我想从 181 到平均,直到每个上升到多高......)。像这样:
if(import.list$Schedule=='Multifix'){
...code to average Long and Lat for Number in Series from 11 up to however high it goes (up to 15)...
} else {
...code to average Long and Lat for Number in Series from 241 up to however high it goes...
}
或者也许我有一个 if else 语句来定义一个变量,然后在函数中使用该变量来进行平均?
...但我想如果输出创建一个新的数据框,这种情况会使事情复杂化,这就是为什么我的目标是只向新列添加值 "AvgLong" 和 "AvgLat." 谢谢任何帮助!!
#dput 函数显示我根据您的问题处理的数据。
dput(df1)
structure(list(ID = c(12L, 13L, 14L, 15L, 26L, 27L, 28L, 29L,
30L, 56L, 57L, 58L, 59L, 60L), Date = c("17/11/2014", "17/11/2014",
"17/11/2014", "17/11/2014", "18/11/2014", "18/11/2014", "18/11/2014",
"18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014",
"18/11/2014", "18/11/2014"), Time = c("22:09:17", "22:09:18",
"22:09:19", "22:09:21", "00:07:14", "00:07:15", "00:07:16", "00:07:17",
"00:07:18", "10:00:24", "10:00:25", "10:00:26", "10:00:27", "10:00:28"
), Long = c(115.9508, 115.9508, 115.9513, 115.9511, 115.9509,
115.9509, 115.9509, 115.9509, 115.9509, 115.9513, 115.9514, 115.9514,
115.9514, 115.9514), Lat = c(-31.8285, -31.82846, -31.82864,
-31.82863, -31.82829, -31.82829, -31.8283, -31.8283, -31.82831,
-31.8267, -31.8267, -31.82669, -31.82668, -31.82668), NoInSeries = c(11L,
12L, 13L, 14L, 11L, 12L, 13L, 14L, 15L, 11L, 12L, 13L, 14L, 15L
)), .Names = c("ID", "Date", "Time", "Long", "Lat", "NoInSeries"
), class = "data.frame", row.names = c(NA, -14L))
#get.counter 当列的值开始减少而不是升序时获取行索引。
get.counter <- function(x){
a1 = x
counter = 0
a2 = c()
for( i in 1:length(a1)){
if(i < length(a1)){
if(a1[i+1] > a1[i]){
counter = counter + 1
}else{
counter = counter + 1
a2 = c(a2, counter)
counter = 0
}
}else{
counter = counter + 1
a2 = c(a2, counter)
}
}
return(a2)
}
# avg.seg.col 函数输出一个数据框,其中包含列的分段平均值。 df1 是输入数据框,colvar 是列名(例如:Long 或 Lat),get_counter 是 get.counter 函数的输出。
avg.seg.col <- function(df1, colvar, get_counter){
long <- c()
start = 1
for(i in cumsum(get_counter)){
end = i
b1 = subset(df1, select = colvar)[start:end,]
mean_b1 = mean(b1)
long = c(long, mean_b1, rep(NA, (length(b1)-1)))
start = end+1
}
return(data.frame(long, stringsAsFactors = FALSE))
}
# 使用 read.table 函数从文本文件中读取数据。您需要确保您的文件存在于当前工作目录中。工作目录可以通过setwd("path of current working directory")
设置
df1 <- read.table(file = "file1.txt",
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
# 使用来自 df1$NoInSeries
的向量应用 get.counter 函数
get_counter <- get.counter(df1$NoInSeries)
# 对长列应用 avg.seg.col 函数
AvgLong <- avg.seg.col(df1, "Long", get_counter)
# 对 Lat 列应用 avg.seg.col 函数
AvgLat <- avg.seg.col(df1, "Lat", get_counter)
# 按列合并数据帧
df2 <- do.call("cbind", list(df1, AvgLong, AvgLat))
# 分配列名
colnames(df2) <- c(colnames(df2)[1:(ncol(df2)-2)], "AvgLong", "AvgLat")
输出:
print(df2)
ID Date Time Long Lat NoInSeries AvgLong AvgLat
1 12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
2 13 17/11/2014 22:09:18 115.9508 -31.82846 12 NA NA
3 14 17/11/2014 22:09:19 115.9513 -31.82864 13 NA NA
4 15 17/11/2014 22:09:21 115.9511 -31.82863 14 NA NA
5 26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
6 27 18/11/2014 00:07:15 115.9509 -31.82829 12 NA NA
7 28 18/11/2014 00:07:16 115.9509 -31.82830 13 NA NA
8 29 18/11/2014 00:07:17 115.9509 -31.82830 14 NA NA
9 30 18/11/2014 00:07:18 115.9509 -31.82831 15 NA NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
11 57 18/11/2014 10:00:25 115.9514 -31.82670 12 NA NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669 13 NA NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668 14 NA NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668 15 NA NA
#删除带有 NA 的行后,输出如下所示
df2[-(which(df2$AvgLong %in% NA)), ]
ID Date Time Long Lat NoInSeries AvgLong AvgLat
1 12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
5 26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
10 56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
似乎使用 aggregate
完成了大部分工作:
> aggregate(df1[ ,c("ID", "Long","Lat")], list( (df1$ID-1) %/% 5), mean)
Group.1 ID Long Lat
1 2 13.5 115.9510 -31.82856
2 5 28.0 115.9509 -31.82830
3 11 58.0 115.9514 -31.82669
需要将 ID 变量移动 1 以获得模除以提供您想要的组。如果你想让某些东西与原始数据保持一致,那么 ave
函数旨在提供:
> df1$aveLong <- ave( df1$Long, (df1$ID-1) %/% 5,
FUN=function(x) c( mean(x), rep(NA, length(x)-1) ) )
> df1$aveLLat <- ave( df1$Lat, (df1$ID-1) %/% 5,
FUN=function(x) c( mean(x), rep(NA, length(x)-1) ) )
> df1
ID Date Time Long Lat NoInSeries aveLong
1 12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510
2 13 17/11/2014 22:09:18 115.9508 -31.82846 12 NA
3 14 17/11/2014 22:09:19 115.9513 -31.82864 13 NA
4 15 17/11/2014 22:09:21 115.9511 -31.82863 14 NA
5 26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509
6 27 18/11/2014 00:07:15 115.9509 -31.82829 12 NA
7 28 18/11/2014 00:07:16 115.9509 -31.82830 13 NA
8 29 18/11/2014 00:07:17 115.9509 -31.82830 14 NA
9 30 18/11/2014 00:07:18 115.9509 -31.82831 15 NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514
11 57 18/11/2014 10:00:25 115.9514 -31.82670 12 NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669 13 NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668 14 NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668 15 NA
aveLLat
1 -31.82856
2 NA
3 NA
4 NA
5 -31.82830
6 NA
7 NA
8 NA
9 NA
10 -31.82669
11 NA
12 NA
13 NA
14 NA
您可以使用 cumsum
、diff
、aggregate
和 merge
x
## Date Time Long Lat NoInSeries SeriesNo
## 1 17/11/2014 22:09:17 115.9508 -31.82850 11 0
## 2 17/11/2014 22:09:18 115.9508 -31.82846 12 0
## 3 17/11/2014 22:09:19 115.9513 -31.82864 13 0
## 4 17/11/2014 22:09:21 115.9511 -31.82863 14 0
## 5 18/11/2014 00:07:14 115.9509 -31.82829 11 1
## 6 18/11/2014 00:07:15 115.9509 -31.82829 12 1
## 7 18/11/2014 00:07:16 115.9509 -31.82830 13 1
## 8 18/11/2014 00:07:17 115.9509 -31.82830 14 1
## 9 18/11/2014 00:07:18 115.9509 -31.82831 15 1
## 10 18/11/2014 10:00:24 115.9513 -31.82670 11 2
## 11 18/11/2014 10:00:25 115.9514 -31.82670 12 2
## 12 18/11/2014 10:00:26 115.9514 -31.82669 13 2
## 13 18/11/2014 10:00:27 115.9514 -31.82668 14 2
## 14 18/11/2014 10:00:28 115.9514 -31.82668 15 2
cumsum(c(0, diff(x$NoInSeries) < 0))
将为您提供一个新列,每次 NoInSeries
的 diff
为负数时都会增加。
# Define a new variable which increments after every drop in NoInSeries
x$SeriesNo <- cumsum(c(0, diff(x$NoInSeries) < 0))
现在您 aggregate
使用新的 SeriesNo
列
# Breakdown ... First aggregate Long, Lat by Series No with Function mean
aggregate(cbind(Long, Lat) ~ SeriesNo, data = x, FUN = mean)
## SeriesNo Long Lat
## 1 0 115.9510 -31.82856
## 2 1 115.9509 -31.82830
## 3 2 115.9514 -31.82669
# merge it back with original data with only rows where NoInSeries = 11
# Final Desired Result in one line
merge(x[x$NoInSeries == 11, c("Date", "Time", "SeriesNo")], aggregate(cbind(Long,
Lat) ~ SeriesNo, data = x, FUN = mean))
## SeriesNo Date Time Long Lat
## 1 0 17/11/2014 22:09:17 115.9510 -31.82856
## 2 1 18/11/2014 00:07:14 115.9509 -31.82830
## 3 2 18/11/2014 10:00:24 115.9514 -31.82669
我读过 for
循环对于迭代操作是必要的,这就是我喜欢 Chinmay 使用 cumsum
和 diff
的原因。我没有足够的声誉来评论@Chinmay Patil 的优雅回答,所以这里有一个稍微不同的方法。
df$group <- 0 #Create a dummy grouping variable
for(i in 2:length(df$NoInSeries)) { #Starting on row 2 to the end
#Check if the series resets (True = 1, False = 0)
check <- df[i-1, "NoInSeries"] > df[i, "NoInSeries"]
df[i, "group"] <- df[i-1, "group"] + check #Add check value to previous row
} #This yields a number for each series
require(plyr)
ddply(df, .(group), summarise,
Date= min(Date), Time=min(Time), Long=mean(Long), Lat= mean(Lat))
# group Date Time Long Lat
#1 0 17/11/2014 22:09:17 115.9510 -31.82856
#2 1 18/11/2014 00:07:14 115.9509 -31.82830
#3 2 18/11/2014 10:00:24 115.9514 -31.82669
您可以按第一次(min
,如上)、最后一次(max
)或平均时间(mean
)报告Lat/Lon。但是,当我在数据框中有 POSIXct dates/times 时,有时 ddply
会出现问题。
我正在处理 GPS 数据,并试图弄清楚如何对纬度和经度的第 11-15 次定位进行平均。我在类似问题中看到了如何对每 n 行进行平均的解决方案。问题是偶尔卫星会爆炸并且修复停止在 13 或 14。所以,在这些情况下,我只想平均 3 或 4 个值而不是 5。所以我正在寻找从开始的纬度和经度的平均值其中系列中的数字为 11,直到系列中的数字再次下降(或者只要它在增加?我需要它包括最后一组,它不会再次下降到较低的数字)。我首先删除了系列中数字不在我想要的 11-15 范围内的所有行。所以,对于一个示例虚拟数据集,这给我留下了:
Date Time Long Lat NoInSeries
12 17/11/2014 22:09:17 115.9508 -31.82850 11
13 17/11/2014 22:09:18 115.9508 -31.82846 12
14 17/11/2014 22:09:19 115.9513 -31.82864 13
15 17/11/2014 22:09:21 115.9511 -31.82863 14
26 18/11/2014 00:07:14 115.9509 -31.82829 11
27 18/11/2014 00:07:15 115.9509 -31.82829 12
28 18/11/2014 00:07:16 115.9509 -31.82830 13
29 18/11/2014 00:07:17 115.9509 -31.82830 14
30 18/11/2014 00:07:18 115.9509 -31.82831 15
56 18/11/2014 10:00:24 115.9513 -31.82670 11
57 18/11/2014 10:00:25 115.9514 -31.82670 12
58 18/11/2014 10:00:26 115.9514 -31.82669 13
59 18/11/2014 10:00:27 115.9514 -31.82668 14
60 18/11/2014 10:00:28 115.9514 -31.82668 15
我想要的输出是这样的,第一个平均 4 (11-14),接下来两个平均 5 (11-15):
Date Time Long Lat NoInSeries AvgLong Avg Lat
12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
13 17/11/2014 22:09:18 115.9508 -31.82846 12 NA NA
14 17/11/2014 22:09:19 115.9513 -31.82864 13 NA NA
15 17/11/2014 22:09:21 115.9511 -31.82863 14 NA NA
26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
27 18/11/2014 00:07:15 115.9509 -31.82829 12 NA NA
28 18/11/2014 00:07:16 115.9509 -31.82830 13 NA NA
29 18/11/2014 00:07:17 115.9509 -31.82830 14 NA NA
30 18/11/2014 00:07:18 115.9509 -31.82831 15 NA NA
56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
57 18/11/2014 10:00:25 115.9514 -31.82670 12 NA NA
58 18/11/2014 10:00:26 115.9514 -31.82669 13 NA NA
59 18/11/2014 10:00:27 115.9514 -31.82668 14 NA NA
60 18/11/2014 10:00:28 115.9514 -31.82668 15 NA NA
然后我将遍历并删除 AvgLong==NA 的所有行,因此我的最终输出将只包含 number in series=11 的所有行的平均值。
我真的不知道从哪里开始编写代码...我发现的示例都讨论了对精确的行数进行平均,而不是对可变数进行平均。
例如:
c( tapply( x, (row(x)-1)%/%5, mean ) )
或:
idx <- ceiling(seq_len(nrow(dd)) / 5)
# do colMeans on all columns except last one.
res <- lapply(split(dd[-(ncol(dd))], idx), colMeans, na.rm = TRUE)
# assign first value of "datetime" in each 5-er group as names to list
names(res) <- dd$datetime[seq(1, nrow(df), by=5)]
# bind them to give a matrix
res <- do.call(rbind, res)
此外,我看到的答案一般都会将平均值输出为新的数据框...最终,我也想在一个条件下进行平均:如果时间表是 'Multifix',我想平均 11 到它上升到 15 时的平均数,而如果时间表是 'Continuous',我想从 181 到平均,直到每个上升到多高......)。像这样:
if(import.list$Schedule=='Multifix'){
...code to average Long and Lat for Number in Series from 11 up to however high it goes (up to 15)...
} else {
...code to average Long and Lat for Number in Series from 241 up to however high it goes...
}
或者也许我有一个 if else 语句来定义一个变量,然后在函数中使用该变量来进行平均?
...但我想如果输出创建一个新的数据框,这种情况会使事情复杂化,这就是为什么我的目标是只向新列添加值 "AvgLong" 和 "AvgLat." 谢谢任何帮助!!
#dput 函数显示我根据您的问题处理的数据。
dput(df1)
structure(list(ID = c(12L, 13L, 14L, 15L, 26L, 27L, 28L, 29L,
30L, 56L, 57L, 58L, 59L, 60L), Date = c("17/11/2014", "17/11/2014",
"17/11/2014", "17/11/2014", "18/11/2014", "18/11/2014", "18/11/2014",
"18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014",
"18/11/2014", "18/11/2014"), Time = c("22:09:17", "22:09:18",
"22:09:19", "22:09:21", "00:07:14", "00:07:15", "00:07:16", "00:07:17",
"00:07:18", "10:00:24", "10:00:25", "10:00:26", "10:00:27", "10:00:28"
), Long = c(115.9508, 115.9508, 115.9513, 115.9511, 115.9509,
115.9509, 115.9509, 115.9509, 115.9509, 115.9513, 115.9514, 115.9514,
115.9514, 115.9514), Lat = c(-31.8285, -31.82846, -31.82864,
-31.82863, -31.82829, -31.82829, -31.8283, -31.8283, -31.82831,
-31.8267, -31.8267, -31.82669, -31.82668, -31.82668), NoInSeries = c(11L,
12L, 13L, 14L, 11L, 12L, 13L, 14L, 15L, 11L, 12L, 13L, 14L, 15L
)), .Names = c("ID", "Date", "Time", "Long", "Lat", "NoInSeries"
), class = "data.frame", row.names = c(NA, -14L))
#get.counter 当列的值开始减少而不是升序时获取行索引。
get.counter <- function(x){
a1 = x
counter = 0
a2 = c()
for( i in 1:length(a1)){
if(i < length(a1)){
if(a1[i+1] > a1[i]){
counter = counter + 1
}else{
counter = counter + 1
a2 = c(a2, counter)
counter = 0
}
}else{
counter = counter + 1
a2 = c(a2, counter)
}
}
return(a2)
}
# avg.seg.col 函数输出一个数据框,其中包含列的分段平均值。 df1 是输入数据框,colvar 是列名(例如:Long 或 Lat),get_counter 是 get.counter 函数的输出。
avg.seg.col <- function(df1, colvar, get_counter){
long <- c()
start = 1
for(i in cumsum(get_counter)){
end = i
b1 = subset(df1, select = colvar)[start:end,]
mean_b1 = mean(b1)
long = c(long, mean_b1, rep(NA, (length(b1)-1)))
start = end+1
}
return(data.frame(long, stringsAsFactors = FALSE))
}
# 使用 read.table 函数从文本文件中读取数据。您需要确保您的文件存在于当前工作目录中。工作目录可以通过setwd("path of current working directory")
设置df1 <- read.table(file = "file1.txt",
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
# 使用来自 df1$NoInSeries
的向量应用 get.counter 函数get_counter <- get.counter(df1$NoInSeries)
# 对长列应用 avg.seg.col 函数
AvgLong <- avg.seg.col(df1, "Long", get_counter)
# 对 Lat 列应用 avg.seg.col 函数
AvgLat <- avg.seg.col(df1, "Lat", get_counter)
# 按列合并数据帧
df2 <- do.call("cbind", list(df1, AvgLong, AvgLat))
# 分配列名
colnames(df2) <- c(colnames(df2)[1:(ncol(df2)-2)], "AvgLong", "AvgLat")
输出:
print(df2)
ID Date Time Long Lat NoInSeries AvgLong AvgLat
1 12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
2 13 17/11/2014 22:09:18 115.9508 -31.82846 12 NA NA
3 14 17/11/2014 22:09:19 115.9513 -31.82864 13 NA NA
4 15 17/11/2014 22:09:21 115.9511 -31.82863 14 NA NA
5 26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
6 27 18/11/2014 00:07:15 115.9509 -31.82829 12 NA NA
7 28 18/11/2014 00:07:16 115.9509 -31.82830 13 NA NA
8 29 18/11/2014 00:07:17 115.9509 -31.82830 14 NA NA
9 30 18/11/2014 00:07:18 115.9509 -31.82831 15 NA NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
11 57 18/11/2014 10:00:25 115.9514 -31.82670 12 NA NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669 13 NA NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668 14 NA NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668 15 NA NA
#删除带有 NA 的行后,输出如下所示
df2[-(which(df2$AvgLong %in% NA)), ]
ID Date Time Long Lat NoInSeries AvgLong AvgLat
1 12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
5 26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
10 56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
似乎使用 aggregate
完成了大部分工作:
> aggregate(df1[ ,c("ID", "Long","Lat")], list( (df1$ID-1) %/% 5), mean)
Group.1 ID Long Lat
1 2 13.5 115.9510 -31.82856
2 5 28.0 115.9509 -31.82830
3 11 58.0 115.9514 -31.82669
需要将 ID 变量移动 1 以获得模除以提供您想要的组。如果你想让某些东西与原始数据保持一致,那么 ave
函数旨在提供:
> df1$aveLong <- ave( df1$Long, (df1$ID-1) %/% 5,
FUN=function(x) c( mean(x), rep(NA, length(x)-1) ) )
> df1$aveLLat <- ave( df1$Lat, (df1$ID-1) %/% 5,
FUN=function(x) c( mean(x), rep(NA, length(x)-1) ) )
> df1
ID Date Time Long Lat NoInSeries aveLong
1 12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510
2 13 17/11/2014 22:09:18 115.9508 -31.82846 12 NA
3 14 17/11/2014 22:09:19 115.9513 -31.82864 13 NA
4 15 17/11/2014 22:09:21 115.9511 -31.82863 14 NA
5 26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509
6 27 18/11/2014 00:07:15 115.9509 -31.82829 12 NA
7 28 18/11/2014 00:07:16 115.9509 -31.82830 13 NA
8 29 18/11/2014 00:07:17 115.9509 -31.82830 14 NA
9 30 18/11/2014 00:07:18 115.9509 -31.82831 15 NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514
11 57 18/11/2014 10:00:25 115.9514 -31.82670 12 NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669 13 NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668 14 NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668 15 NA
aveLLat
1 -31.82856
2 NA
3 NA
4 NA
5 -31.82830
6 NA
7 NA
8 NA
9 NA
10 -31.82669
11 NA
12 NA
13 NA
14 NA
您可以使用 cumsum
、diff
、aggregate
和 merge
x
## Date Time Long Lat NoInSeries SeriesNo
## 1 17/11/2014 22:09:17 115.9508 -31.82850 11 0
## 2 17/11/2014 22:09:18 115.9508 -31.82846 12 0
## 3 17/11/2014 22:09:19 115.9513 -31.82864 13 0
## 4 17/11/2014 22:09:21 115.9511 -31.82863 14 0
## 5 18/11/2014 00:07:14 115.9509 -31.82829 11 1
## 6 18/11/2014 00:07:15 115.9509 -31.82829 12 1
## 7 18/11/2014 00:07:16 115.9509 -31.82830 13 1
## 8 18/11/2014 00:07:17 115.9509 -31.82830 14 1
## 9 18/11/2014 00:07:18 115.9509 -31.82831 15 1
## 10 18/11/2014 10:00:24 115.9513 -31.82670 11 2
## 11 18/11/2014 10:00:25 115.9514 -31.82670 12 2
## 12 18/11/2014 10:00:26 115.9514 -31.82669 13 2
## 13 18/11/2014 10:00:27 115.9514 -31.82668 14 2
## 14 18/11/2014 10:00:28 115.9514 -31.82668 15 2
cumsum(c(0, diff(x$NoInSeries) < 0))
将为您提供一个新列,每次 NoInSeries
的 diff
为负数时都会增加。
# Define a new variable which increments after every drop in NoInSeries
x$SeriesNo <- cumsum(c(0, diff(x$NoInSeries) < 0))
现在您 aggregate
使用新的 SeriesNo
列
# Breakdown ... First aggregate Long, Lat by Series No with Function mean
aggregate(cbind(Long, Lat) ~ SeriesNo, data = x, FUN = mean)
## SeriesNo Long Lat
## 1 0 115.9510 -31.82856
## 2 1 115.9509 -31.82830
## 3 2 115.9514 -31.82669
# merge it back with original data with only rows where NoInSeries = 11
# Final Desired Result in one line
merge(x[x$NoInSeries == 11, c("Date", "Time", "SeriesNo")], aggregate(cbind(Long,
Lat) ~ SeriesNo, data = x, FUN = mean))
## SeriesNo Date Time Long Lat
## 1 0 17/11/2014 22:09:17 115.9510 -31.82856
## 2 1 18/11/2014 00:07:14 115.9509 -31.82830
## 3 2 18/11/2014 10:00:24 115.9514 -31.82669
我读过 for
循环对于迭代操作是必要的,这就是我喜欢 Chinmay 使用 cumsum
和 diff
的原因。我没有足够的声誉来评论@Chinmay Patil 的优雅回答,所以这里有一个稍微不同的方法。
df$group <- 0 #Create a dummy grouping variable
for(i in 2:length(df$NoInSeries)) { #Starting on row 2 to the end
#Check if the series resets (True = 1, False = 0)
check <- df[i-1, "NoInSeries"] > df[i, "NoInSeries"]
df[i, "group"] <- df[i-1, "group"] + check #Add check value to previous row
} #This yields a number for each series
require(plyr)
ddply(df, .(group), summarise,
Date= min(Date), Time=min(Time), Long=mean(Long), Lat= mean(Lat))
# group Date Time Long Lat
#1 0 17/11/2014 22:09:17 115.9510 -31.82856
#2 1 18/11/2014 00:07:14 115.9509 -31.82830
#3 2 18/11/2014 10:00:24 115.9514 -31.82669
您可以按第一次(min
,如上)、最后一次(max
)或平均时间(mean
)报告Lat/Lon。但是,当我在数据框中有 POSIXct dates/times 时,有时 ddply
会出现问题。