
Create group based on fuzzy criteria


Name   Start_Date   End_Date
A      2015-01-01   2019-12-29
A      2017-03-25   NA
A      2019-10-17   NA
A      2012-04-16   2015-01-09
A      2002-06-01   2006-02-01
A      2005-12-24   NA
B      2018-01-23   NA

我想创建一个列,如果两个观测值具有相同的 Name,并且一个观测值的 Start_Date 与另一个观测值的 End_Date 相差 ±1 年,则它们被分类属于同一组。


Name   Start_Date   End_Date    Wanted
A      2015-01-01   2019-12-29  1
A      2017-03-25   NA          NA
A      2019-10-17   NA          1
A      2012-04-16   2015-01-09  1
A      2002-06-01   2006-02-01  2
A      2005-12-24   NA          2
B      2018-01-23   NA          NA

我正在寻找数据 table 的解决方案,但解决我的问题就足够了。

添加: 逐行解释

  1. 第 4 行的开始日期比结束日期早 8 天(< 1 年)。它与第 4 行在同一组中。
  2. 开始日期比第 1 行的结束日期晚 2 年以上。与第1行不在同一组。与第4、5行相同。它与那两个也不在同一组。
  3. 第 1 行的开始日期比结束日期早 2 个月(< 1 年)。它与第 1 行在同一组中。
  4. 见第 1 行。
  5. 见下文。
  6. 第 5 行的开始日期比结束日期早 3 个月(< 1 年)。它与第 5 行在同一组中。
  7. 没有其他名字 B 可以比较。它在自己的组中。

因此,行 134 属于同一组。第 5 行和 6 行在同一组中。 2 行和 7 行没有分组。

编辑:我更新了我的代码以保持一致 Wanted当一个观察结果与另一个观察结果不匹配时的类别。


这是首选 data.table 的解决方案:

I would prefer a solution with data.table but any solutions at all are much appreciated!

虽然 dplyr and fuzzyjoin 可能看起来更优雅,但对于足够大的数据集,它们也可能证明效率较低。

归功于 ThomasIsCoding for beating me to the punch on , with that harnesses igraph to index networks in graphs. Here, the networks are the separate "chains" (Wanted groups) comprised of "links" (data.frame rows), which are joined by their "closeness" (between their Start_Dates and End_Dates). Such an approach seemed necessary to model the transitive relationship ℛ requested here

I am trying to create the chain of "close" links so that I can map A's movements over time.

同时注意保持 ℛ 的对称性(参见进一步阅读)。

据此 same request

So I would ideally like to flag situations where one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31) and vice versa.


...I would want another column that indicates that [flag].

我还包含了一个 Flag 列,用于标记每一行 Start_Date 与至少 flag_at 其他行的 End_Date 匹配;反之亦然。


使用您的示例 data.frame,在此处转载为 my_data_frame

# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
                                Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
                                                       class = "Date"),
                                End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
                                                     class = "Date")),
                           row.names = c(NA, -7L),
                           class = "data.frame")

我们应用 data.tableigraph(以及其他包)如下:


# ...
# Code to generate your data.frame 'my_data_frame'.
# ...

# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()

# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)

# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2

# BEGIN: Start Indexing the Groups. #

# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
  # STEP 1: Link the Rows That Are "Close" to Each Other #
  # Prepare data.table for JOIN, by adding appropriate helper columns.
  .[, `:=`(# Uniquely identify each row (by row number).
           ID = .I,
           # Boundary columns for tolerance threshold.
           End_Low = End_Date - tolerance,
           End_High = End_Date + tolerance)] %>%
  # JOIN rows to each other, to obtain pairings.
    # Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
    # close enough to (within the boundary columns for) the 'End_Date' of y.
    .(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
      y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
    # JOIN criteria:
    on = .(# Only pair rows having the same name.
           # Only pair rows whose start and end dates are within the tolerance
           # threshold of each other.
           End_Low <= Start_Date,
           End_High >= Start_Date),
    # Make it an OUTER JOIN, to include those rows without a match.
    nomatch = NA] %>%
  # Prepare pairings for network analysis.
  .[# Ensure no row is reflexively paired with itself.
    #   NOTE: This keeps the graph clean by trimming extraneous loops, and it
    #   prevents an "orphan" row from contributing to its own tally of matches.
    !(x.ID == y.ID) %in% TRUE,
    # !(x.ID == y.ID) %in% TRUE,
    # Simplify the dataset to only the pairings (by ID) of linked rows.
    .(from = x.ID, to = y.ID)]

# PAUSE: Count the Matches. #

# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
  # Include again the missing IDs for y that were never matched by the JOIN.
  .[my_data_table[, .(ID)], on = .(to = ID)] %>%
  # For each row y, count every other row x where x R y.
  .[, .(Matches = sum(!is.na(from))), by = to] %>%
  # Extract the count column.

# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
  # For each row x, count every other row y where x R y.
  .[, .(Matches = sum(!is.na(to))), by = from] %>%
  # Extract the count column.

# RESUME: Continue Indexing the Groups. #

# Resume indexing:
output <- output %>%
  # Ignore nonmatches (NAs) which are annoying to process into a graph.
  .[from != to, ] %>%
  # STEP 2: Index the Separate "Chains" Formed By Those "Links" #
  # Convert pairings (by ID) of linked rows into an undirected graph.
  igraph::graph_from_data_frame(directed = FALSE) %>%
  # Find all groups (subgraphs) of transitively linked IDs.
  igraph::components() %>%
  # Pair each ID with its group index.
  igraph::membership() %>%
  # Tabulate those pairings...
  utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
  # ...in a properly named data.table.
  data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
  # STEP 3: Match the Original Rows to their "Chains" #
  # LEFT JOIN (on ID) to match each original row to its group index (if any).
  .[my_data_table, on = .(ID)] %>%
  # Transform output into final form.
  .[# Sort into original order.
    .(# Select existing columns.
      Name, Start_Date, End_Date,
      # Rename column having the group indices.
      Wanted = Group_Index,
      # Calculate column(s) to flag rows with sufficient matches.
      Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]

# View results.



   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1 FALSE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1 FALSE
4:    A 2012-04-16 2015-01-09      1 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      2 FALSE
7:    B 2018-01-23       <NA>     NA FALSE

请记住,Flag 都是 FALSE,因为您的数据缺少任何 Start_Date 匹配(至少)两个 End_Dates;连同任何 End_Date 匹配(至少) two Start_Dates.

假设,如果我们将 flag_at 降低到 1,那么 outputFlag 每一行甚至 单个 匹配(在任一方向):

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1  TRUE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1  TRUE
4:    A 2012-04-16 2015-01-09      1  TRUE
5:    A 2002-06-01 2006-02-01      2  TRUE
6:    A 2005-12-24       <NA>      2  TRUE
7:    B 2018-01-23       <NA>     NA FALSE


由于某些 data.table operations modify by reference(或“就地”),my_data_table 的值在整个工作流程中发生变化。在第 1 步之后,my_data_table 变为

   Name Start_Date   End_Date ID    End_Low   End_High
1:    A 2015-01-01 2019-12-29  1 2018-12-29 2020-12-29
2:    A 2017-03-25       <NA>  2       <NA>       <NA>
3:    A 2019-10-17       <NA>  3       <NA>       <NA>
4:    A 2012-04-16 2015-01-09  4 2014-01-09 2016-01-09
5:    A 2002-06-01 2006-02-01  5 2005-02-01 2007-02-01
6:    A 2005-12-24       <NA>  6       <NA>       <NA>
7:    B 2018-01-23       <NA>  7       <NA>       <NA>

与最初复制的 my_data_frame 结构不同。

由于 dplyr(在其他包中)按值而不是按引用分配,dplyr 解决方案将完全回避这个问题。

然而,实际上,您在修改工作流程时必须小心,因为第 1 步之前可用的 my_data_table 版本 不能之后恢复


虽然 data.tables 的 JOINing 是明确定向的——有“右”边和“左”边——这个模型设法保留了你描述的 relational symmetry这里

if...[either] one's 'Start_Date' is +- 1 year within the other observation's 'End_Date', they are classified as being in the same group.

通过使用 undirected graph.

JOIN 将第 1 行(具有 2015-01-01Start_Date)与第 4 行(具有 2015-01-09End_Date)关联时,我们收集到 Start_DateStart_Date “足够接近”(在 1 年内)的 End_Date 。所以我们在数学上说 ℛ 或

"is in the same group as" .

但是converse不一定会出现在JOINed数据中,因为[=175=的Start_Date ] 在 End_Date 附近着陆可能不太方便。也就是说,JOINed数据不一定表示

"is in the same group as" .

在后一种情况下,严格的 directed graph(“二合字母”)将 而不是 捕获同一组的共同成员身份。您可以通过在步骤 2

的第一行中设置 directed = TRUE 来观察这个 不和谐的差异
  igraph::graph_from_data_frame(directed = TRUE) %>%

并在下一行设置 mode = "strong"

  igraph::components(mode = "strong") %>%


   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      4 FALSE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      3 FALSE
4:    A 2012-04-16 2015-01-09      5 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      1 FALSE
7:    B 2018-01-23       <NA>     NA FALSE

相比之下,可以通过使用无向图 (directed = FALSE) 正确 对行进行分组;或通过更宽松的标准 (mode = "weak")。只要 ℛ 出现在 JOINed 数据中,这些方法中的任何一种都将有效地模拟 ℛ 的存在。

这个对称的 属性 特别 在对您描述的行为进行建模时很重要 here:

...one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31)...

在这种情况下,您希望模型识别任何两行并且必须在同一组 (ℛ) 中,只要它们的 End_Date 匹配其他行的相同 Start_Date : ℛ 和 ℛ .

所以假设我们知道 ℛ 和 ℛ 。因为我们的模型保留了对称性,所以我们也可以从 ℛ 说 ℛ。由于我们现在知道 ℛ 和 ℛ , transitivity 意味着 ℛ 。因此,我们的模型识别出 ℛ 每当 ℛ 和 ℛ !类似的逻辑就足以“反之亦然”。


my_data_frame <- my_data_frame %>%
  rbind(list(Name = "A",
             Start_Date = as.Date("2010-01-01"),
             End_Date = as.Date("2015-01-05")))

在工作流程之前将第 8 行附加到 my_data_frame

    Name Start_Date   End_Date
  1    A 2015-01-01 2019-12-29
# ⋮    ⋮      ⋮           ⋮
  4    A 2012-04-16 2015-01-09
# ⋮    ⋮      ⋮           ⋮
  8    A 2010-01-01 2015-01-05

第 8 行作为我们的 ,其中 是第 1 行, 是第 4 行,和以前一样。实际上,output 正确地将 and 和 分类为属于同一组 1:ℛ .

   Name Start_Date   End_Date Wanted  Flag
1:    A 2015-01-01 2019-12-29      1  TRUE
2:    A 2017-03-25       <NA>     NA FALSE
3:    A 2019-10-17       <NA>      1 FALSE
4:    A 2012-04-16 2015-01-09      1 FALSE
5:    A 2002-06-01 2006-02-01      2 FALSE
6:    A 2005-12-24       <NA>      2 FALSE
7:    B 2018-01-23       <NA>     NA FALSE
8:    A 2010-01-01 2015-01-05      1 FALSE

同样,output 正确地 Flag 在第一行,其 Start_Date 现在与两个 End_Date 匹配:在第 4 行和第 8 行。
