通过重复值和何时有断点创建具有条件的新列

Question

我的数据是大约40只动物（ids）通过遥测定位，我已经规定了3个区域。第一个是AR，这里是繁殖区，AM迁徙，AA是饲养区。所有动物的第一个位置是 AR。但有时动物还处于繁殖期（在AR），但可以出去AM几次，然后又回到AR。只有当动物只有 AM 时它们才开始迁徙，直到到达觅食区 AA。所以，它们从AR出发，然后开始迁徙AM，然后到达觅食区AA。

我正在尝试创建一个新的专栏，其中包含一些我还不知道如何操作的条件，例如我有这个数据框

id     area   
2304   AR
2304   AR
2304   AR
2304   AM  #this AM for example, can repeat until 20 times and then came back to AR
2304   AM
2304   AR
2304   AR
2304   AR
2304   AM
2304   AM
2304   AM
2304   AM
2304   ...
2304   AM
2304   AM
2304   AM
2304   AA
2304   AA
2304   ...
2304   AA

所以，当有 AR x 次，之后有一次或直到 20 AM 回来有 AR，我想要一个新的 AR 专栏。到有 AM x 次并且只有 AM，没有回到 AR 的那一刻，我想要带有 AM 的新专栏。像这样：

AA 没问题，AA = AA 总是

我预料到了：

id    area    fixed_area
2304   AR      AR
2304   AR      AR
2304   AR      AR
2304   AM      AR  #this AM for example, can repeat until 20 times and then came back to AR
2304   AM      AR
2304   AR      AR
2304   AR      AR
2304   AR      AR
2304   AM      AM
2304   AM      AM
2304   AM      AM
2304   AM      AM
2304   ...     ...
2304   AM      AM
2304   AM      AM
2304   AM      AM
2304   AA      AA
2304   AA      AA
2304   ...    ...
2304   AA      AA

我试过这个：

但是 AA 不见了，也许问题是因为需要对每只动物进行这种分离 (id)

> table(df$area)

   AA    AM    AR 
31460 39101 28820 

class(df$area)
[1] "character"
> idx <- with(rle(as.character(df$area)), rep(seq_along(lengths), lengths))
> df$fixed_area <- with(df, replace(area, idx < max(idx[area == 'AM']), 'AR'))
> table(df$fixed_area)

   AM    AR 
  145 99236 
>

在此之后我输入了数据框，但是我的数据框有超过 90.000 行，所以我只复制了头部值

> dput(head(df))
structure(list(DeployID = c("111868_16", "111868_16", "111868_16", 
"111868_16", "111868_16", "111868_16"), Start = structure(c(1477323868, 
1477323946, 1477324002, 1477324044, 1477324260, 1477324480), class = c("POSIXct", 
"POSIXt"), tzone = "GMT"), End = structure(c(1477323944, 1477324000, 
1477324042, 1477324170, 1477324458, 1477324542), class = c("POSIXct", 
"POSIXt"), tzone = "GMT"), What = structure(c(1L, 1L, 1L, 1L, 
1L, 1L), .Label = c("Dive", "Message", "Surface"), class = "factor"), 
    Shape = structure(c(2L, 4L, 3L, 2L, 2L, 2L), .Label = c("", 
    "Square", "U", "V"), class = "factor"), DepthMean = c(14.5, 
    16.5, 13, 14.5, 11, 12.5), DurationMean = c(76, 54, 40, 126, 
    198, 62), DepthMin = c(14.5, 16.5, 13, 14.5, 11, 12.5), DepthMax = c(14.5, 
    16.5, 13, 14.5, 11, 12.5), depth_range = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("shallow", "deep"), class = c("ordered", 
    "factor")), MidTime = structure(c(1477323906, 1477323973, 
    1477324022, 1477324107, 1477324359, 1477324511), class = c("POSIXct", 
    "POSIXt"), tzone = "GMT"), year = c(2016, 2016, 2016, 2016, 
    2016, 2016), id = c("111868_16", "111868_16", "111868_16", 
    "111868_16", "111868_16", "111868_16"), segmentid = c("111868_16", 
    "111868_16", "111868_16", "111868_16", "111868_16", "111868_16"
    ), mu.x = c(-4446545.25191192, -4446557.10576816, -4446565.77504969, 
    -4446580.81370994, -4446625.40007808, -4446652.29459533), 
    mu.y = c(-2305423.86124176, -2305461.88537725, -2305489.69364377, 
    -2305537.93137917, -2305680.93056743, -2305767.17264774), 
    lon = c(-39.9439956132156, -39.944102098218, -39.944179975699, 
    -39.9443150702825, -39.9447155964422, -39.9449571940013), 
    lat = c(-20.3985940756941, -20.3989161274532, -20.3991516537744, 
    -20.3995602097098, -20.4007713539709, -20.4015017842338), 
    lq_closest_filt = c(7L, 7L, 7L, 7L, 7L, 7L), dt_closest_filt = c(0.0516666666666667, 
    0.0702777777777778, 0.0838888888888889, 0.1075, 0.1775, 0.219722222222222
    ), dist_closest_filt = c(0.103680210832692, 0.141026573116106, 
    0.168339162761167, 0.215717097671267, 0.356168027785347, 
    0.440874049523752), rel.angle = c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_), speed = c(NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), depth_bin = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("(0,50]", "(50,100]", "(100,150]", 
    "(150,200]", "(200,250]", "(250,300]", "(300,350]", "(350,400]", 
    "(400,450]", "(450,500]", "(500,550]", "(550,600]", "(600,650]", 
    "(650,700]"), class = "factor"), bat = structure(list(depth = c(-59L, 
    -59L, -59L, -59L, -59L, -59L)), row.names = c(NA, 6L), class = "data.frame"), 
    area = c("AR", "AR", "AR", "AR", "AR", "AR")), row.names = c(NA, 
6L), class = "data.frame")

有人知道如何解决这个问题吗？谢谢！

Answer 1

听起来您可能需要一些规则来决定哪些行 AM 变为 AR。

如果连续AM个数<20
如果下面的目的地是不是AA

一种方法是使用 rle 添加与这两个规则相关的列。对于重复序列中连续值的数量，一列将具有 lengths。另一列将包含 "next" 区域。这将决定目的地是回到繁殖区，还是继续到觅食区。

最后，您可以使用条件语句，将 AM 到 AR 的那些行更改为满足这些条件：

当前 area 是 AM
下一个area之后是不是AA
重复值的个数小于20

代码如下：

df_rle <- rle(df$area)
df2 <- cbind(df, next_area = with(df_rle, rep(c(values[-1], NA), lengths)),
                 count = with(df_rle, rep(lengths, lengths)))
df2$area <- ifelse(with(df2, area == "AM" & next_area != "AA" & count < 20),
                   "AR", df2$area)

通过重复值和何时有断点创建具有条件的新列

Create new column with condition by repeat values and when have break points

r

breakpoints

repeat

conditional-breakpoint

conditional-statements