根据模糊标准创建组
Create group based on fuzzy criteria
我有一个如下所示的数据框:
Name Start_Date End_Date
A 2015-01-01 2019-12-29
A 2017-03-25 NA
A 2019-10-17 NA
A 2012-04-16 2015-01-09
A 2002-06-01 2006-02-01
A 2005-12-24 NA
B 2018-01-23 NA
我想创建一个列,如果两个观测值具有相同的 Name
,并且一个观测值的 Start_Date
与另一个观测值的 End_Date
相差 ±1 年,则它们被分类属于同一组。
期望的输出:
Name Start_Date End_Date Wanted
A 2015-01-01 2019-12-29 1
A 2017-03-25 NA NA
A 2019-10-17 NA 1
A 2012-04-16 2015-01-09 1
A 2002-06-01 2006-02-01 2
A 2005-12-24 NA 2
B 2018-01-23 NA NA
我正在寻找数据 table 的解决方案,但解决我的问题就足够了。
添加: 逐行解释
行:
- 第 4 行的开始日期比结束日期早 8 天(< 1 年)。它与第 4 行在同一组中。
- 开始日期比第 1 行的结束日期晚 2 年以上。与第1行不在同一组。与第4、5行相同。它与那两个也不在同一组。
- 第 1 行的开始日期比结束日期早 2 个月(< 1 年)。它与第 1 行在同一组中。
- 见第 1 行。
- 见下文。
- 第 5 行的开始日期比结束日期早 3 个月(< 1 年)。它与第 5 行在同一组中。
- 没有其他名字 B 可以比较。它在自己的组中。
因此,行 1
、3
和 4
属于同一组。第 5
行和 6
行在同一组中。 2
行和 7
行没有分组。
编辑:我更新了我的代码以保持一致 Wanted
当一个观察结果与另一个观察结果不匹配时的类别。
方法
这是首选 data.table
的解决方案:
I would prefer a solution with data.table but any solutions at all are much appreciated!
虽然 dplyr
and fuzzyjoin
可能看起来更优雅,但对于足够大的数据集,它们也可能证明效率较低。
归功于 ThomasIsCoding for beating me to the punch on , with that harnesses igraph
to index networks in graphs. Here, the networks are the separate "chains" (Wanted
groups) comprised of "links" (data.frame
rows), which are joined by their "closeness" (between their Start_Date
s and End_Date
s). Such an approach seemed necessary to model the transitive relationship ℛ requested here
I am trying to create the chain of "close" links so that I can map A's movements over time.
同时注意保持 ℛ 的对称性(参见进一步阅读)。
据此 same request
So I would ideally like to flag situations where one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31) and vice versa.
和你的
...I would want another column that indicates that [flag].
我还包含了一个 Flag
列,用于标记每一行 Start_Date
与至少 flag_at
其他行的 End_Date
匹配;反之亦然。
解决方案
使用您的示例 data.frame
,在此处转载为 my_data_frame
# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
class = "Date"),
End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
class = "Date")),
row.names = c(NA, -7L),
class = "data.frame")
我们应用 data.table
和 igraph
(以及其他包)如下:
library(tidyverse)
library(data.table)
library(lubridate)
library(igraph)
# ...
# Code to generate your data.frame 'my_data_frame'.
# ...
# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()
# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)
# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2
#####################################
# BEGIN: Start Indexing the Groups. #
#####################################
# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
########################################################
# STEP 1: Link the Rows That Are "Close" to Each Other #
########################################################
# Prepare data.table for JOIN, by adding appropriate helper columns.
.[, `:=`(# Uniquely identify each row (by row number).
ID = .I,
# Boundary columns for tolerance threshold.
End_Low = End_Date - tolerance,
End_High = End_Date + tolerance)] %>%
# JOIN rows to each other, to obtain pairings.
.[my_data_table,
# Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
# close enough to (within the boundary columns for) the 'End_Date' of y.
.(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
# JOIN criteria:
on = .(# Only pair rows having the same name.
Name,
# Only pair rows whose start and end dates are within the tolerance
# threshold of each other.
End_Low <= Start_Date,
End_High >= Start_Date),
# Make it an OUTER JOIN, to include those rows without a match.
nomatch = NA] %>%
# Prepare pairings for network analysis.
.[# Ensure no row is reflexively paired with itself.
# NOTE: This keeps the graph clean by trimming extraneous loops, and it
# prevents an "orphan" row from contributing to its own tally of matches.
!(x.ID == y.ID) %in% TRUE,
# !(x.ID == y.ID) %in% TRUE,
# Simplify the dataset to only the pairings (by ID) of linked rows.
.(from = x.ID, to = y.ID)]
#############################
# PAUSE: Count the Matches. #
#############################
# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
# Include again the missing IDs for y that were never matched by the JOIN.
.[my_data_table[, .(ID)], on = .(to = ID)] %>%
# For each row y, count every other row x where x R y.
.[, .(Matches = sum(!is.na(from))), by = to] %>%
# Extract the count column.
.$Matches
# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
# For each row x, count every other row y where x R y.
.[, .(Matches = sum(!is.na(to))), by = from] %>%
# Extract the count column.
.$Matches
#########################################
# RESUME: Continue Indexing the Groups. #
#########################################
# Resume indexing:
output <- output %>%
# Ignore nonmatches (NAs) which are annoying to process into a graph.
.[from != to, ] %>%
###############################################################
# STEP 2: Index the Separate "Chains" Formed By Those "Links" #
###############################################################
# Convert pairings (by ID) of linked rows into an undirected graph.
igraph::graph_from_data_frame(directed = FALSE) %>%
# Find all groups (subgraphs) of transitively linked IDs.
igraph::components() %>%
# Pair each ID with its group index.
igraph::membership() %>%
# Tabulate those pairings...
utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
# ...in a properly named data.table.
data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
#####################################################
# STEP 3: Match the Original Rows to their "Chains" #
#####################################################
# LEFT JOIN (on ID) to match each original row to its group index (if any).
.[my_data_table, on = .(ID)] %>%
# Transform output into final form.
.[# Sort into original order.
order(ID),
.(# Select existing columns.
Name, Start_Date, End_Date,
# Rename column having the group indices.
Wanted = Group_Index,
# Calculate column(s) to flag rows with sufficient matches.
Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]
# View results.
output
结果
结果output
如下data.table
:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
请记住,Flag
都是 FALSE
,因为您的数据缺少任何 Start_Date
匹配(至少)两个 End_Date
s;连同任何 End_Date
匹配(至少) two Start_Date
s.
假设,如果我们将 flag_at
降低到 1
,那么 output
将 Flag
每一行甚至 单个 匹配(在任一方向):
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 TRUE
4: A 2012-04-16 2015-01-09 1 TRUE
5: A 2002-06-01 2006-02-01 2 TRUE
6: A 2005-12-24 <NA> 2 TRUE
7: B 2018-01-23 <NA> NA FALSE
警告
由于某些 data.table
operations modify by reference(或“就地”),my_data_table
的值在整个工作流程中发生变化。在第 1 步之后,my_data_table
变为
Name Start_Date End_Date ID End_Low End_High
1: A 2015-01-01 2019-12-29 1 2018-12-29 2020-12-29
2: A 2017-03-25 <NA> 2 <NA> <NA>
3: A 2019-10-17 <NA> 3 <NA> <NA>
4: A 2012-04-16 2015-01-09 4 2014-01-09 2016-01-09
5: A 2002-06-01 2006-02-01 5 2005-02-01 2007-02-01
6: A 2005-12-24 <NA> 6 <NA> <NA>
7: B 2018-01-23 <NA> 7 <NA> <NA>
与最初复制的 my_data_frame
结构不同。
由于 dplyr
(在其他包中)按值而不是按引用分配,dplyr
解决方案将完全回避这个问题。
然而,实际上,您在修改工作流程时必须小心,因为第 1 步之前可用的 my_data_table
版本 不能之后恢复。
进一步阅读
虽然 data.table
s 的 JOIN
ing 是明确定向的——有“右”边和“左”边——这个模型设法保留了你描述的 relational symmetry这里
if...[either] one's 'Start_Date' is +- 1 year within the other observation's 'End_Date', they are classified as being in the same group.
通过使用 undirected graph.
当 JOIN
将第 1 行(具有 2015-01-01
的 Start_Date
)与第 4 行(具有 2015-01-09
的 End_Date
)关联时,我们收集到 Start_Date
的 Start_Date
“足够接近”(在 1 年内)的 End_Date
。所以我们在数学上说 ℛ 或
"is in the same group as" .
但是converseℛ不一定会出现在JOIN
ed数据中,因为[=175=的Start_Date
] 在 的 End_Date
附近着陆可能不太方便。也就是说,JOIN
ed数据不一定表示
"is in the same group as" .
在后一种情况下,严格的 directed graph(“二合字母”)将 而不是 捕获同一组的共同成员身份。您可以通过在步骤 2
的第一行中设置 directed = TRUE
来观察这个 不和谐的差异
igraph::graph_from_data_frame(directed = TRUE) %>%
并在下一行设置 mode = "strong"
igraph::components(mode = "strong") %>%
产生这些分离的结果:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 4 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 3 FALSE
4: A 2012-04-16 2015-01-09 5 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 1 FALSE
7: B 2018-01-23 <NA> NA FALSE
相比之下,可以通过使用无向图 (directed = FALSE
) 正确 对行进行分组;或通过更宽松的标准 (mode = "weak"
)。只要 ℛ 出现在 JOIN
ed 数据中,这些方法中的任何一种都将有效地模拟 ℛ 的存在。
这个对称的 属性 特别 在对您描述的行为进行建模时很重要 here:
...one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31)...
在这种情况下,您希望模型识别任何两行并且必须在同一组 (ℛ) 中,只要它们的 End_Date
匹配其他行的相同 Start_Date
: ℛ 和 ℛ .
所以假设我们知道 ℛ 和 ℛ 。因为我们的模型保留了对称性,所以我们也可以从 ℛ 说 ℛ。由于我们现在知道 ℛ 和 ℛ , transitivity 意味着 ℛ 。因此,我们的模型识别出 ℛ 每当 ℛ 和 ℛ !类似的逻辑就足以“反之亦然”。
我们可以使用
来验证这个结果
my_data_frame <- my_data_frame %>%
rbind(list(Name = "A",
Start_Date = as.Date("2010-01-01"),
End_Date = as.Date("2015-01-05")))
在工作流程之前将第 8 行附加到 my_data_frame
:
Name Start_Date End_Date
1 A 2015-01-01 2019-12-29
# ⋮ ⋮ ⋮ ⋮
4 A 2012-04-16 2015-01-09
# ⋮ ⋮ ⋮ ⋮
8 A 2010-01-01 2015-01-05
第 8 行作为我们的 ,其中 是第 1 行, 是第 4 行,和以前一样。实际上,output
正确地将 and 和 分类为属于同一组 1
:ℛ .
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
8: A 2010-01-01 2015-01-05 1 FALSE
同样,output
正确地 Flag
在第一行,其 Start_Date
现在与两个 End_Date
匹配:在第 4 行和第 8 行。
干杯!
我有一个如下所示的数据框:
Name Start_Date End_Date
A 2015-01-01 2019-12-29
A 2017-03-25 NA
A 2019-10-17 NA
A 2012-04-16 2015-01-09
A 2002-06-01 2006-02-01
A 2005-12-24 NA
B 2018-01-23 NA
我想创建一个列,如果两个观测值具有相同的 Name
,并且一个观测值的 Start_Date
与另一个观测值的 End_Date
相差 ±1 年,则它们被分类属于同一组。
期望的输出:
Name Start_Date End_Date Wanted
A 2015-01-01 2019-12-29 1
A 2017-03-25 NA NA
A 2019-10-17 NA 1
A 2012-04-16 2015-01-09 1
A 2002-06-01 2006-02-01 2
A 2005-12-24 NA 2
B 2018-01-23 NA NA
我正在寻找数据 table 的解决方案,但解决我的问题就足够了。
添加: 逐行解释
行:
- 第 4 行的开始日期比结束日期早 8 天(< 1 年)。它与第 4 行在同一组中。
- 开始日期比第 1 行的结束日期晚 2 年以上。与第1行不在同一组。与第4、5行相同。它与那两个也不在同一组。
- 第 1 行的开始日期比结束日期早 2 个月(< 1 年)。它与第 1 行在同一组中。
- 见第 1 行。
- 见下文。
- 第 5 行的开始日期比结束日期早 3 个月(< 1 年)。它与第 5 行在同一组中。
- 没有其他名字 B 可以比较。它在自己的组中。
因此,行 1
、3
和 4
属于同一组。第 5
行和 6
行在同一组中。 2
行和 7
行没有分组。
编辑:我更新了我的代码以保持一致 Wanted
当一个观察结果与另一个观察结果不匹配时的类别。
方法
这是首选 data.table
的解决方案:
I would prefer a solution with data.table but any solutions at all are much appreciated!
虽然 dplyr
and fuzzyjoin
可能看起来更优雅,但对于足够大的数据集,它们也可能证明效率较低。
归功于 ThomasIsCoding for beating me to the punch on igraph
to index networks in graphs. Here, the networks are the separate "chains" (Wanted
groups) comprised of "links" (data.frame
rows), which are joined by their "closeness" (between their Start_Date
s and End_Date
s). Such an approach seemed necessary to model the transitive relationship ℛ requested here
I am trying to create the chain of "close" links so that I can map A's movements over time.
同时注意保持 ℛ 的对称性(参见进一步阅读)。
据此 same request
So I would ideally like to flag situations where one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31) and vice versa.
和你的
...I would want another column that indicates that [flag].
我还包含了一个 Flag
列,用于标记每一行 Start_Date
与至少 flag_at
其他行的 End_Date
匹配;反之亦然。
解决方案
使用您的示例 data.frame
,在此处转载为 my_data_frame
# Generate dataset as data.frame.
my_data_frame <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "B"),
Start_Date = structure(c(16436, 17250, 18186, 15446, 11839, 13141, 17554),
class = "Date"),
End_Date = structure(c(18259, NA, NA, 16444, 13180, NA, NA),
class = "Date")),
row.names = c(NA, -7L),
class = "data.frame")
我们应用 data.table
和 igraph
(以及其他包)如下:
library(tidyverse)
library(data.table)
library(lubridate)
library(igraph)
# ...
# Code to generate your data.frame 'my_data_frame'.
# ...
# Treat dataset as a data.table.
my_data_table <- my_data_frame %>% data.table::as.data.table()
# Define the tolerance threshold as a (lubridate) "period": 1 year.
tolerance <- lubridate::years(1)
# Set the minimum number of matches for an row to be flagged: 2.
flag_at <- 2
#####################################
# BEGIN: Start Indexing the Groups. #
#####################################
# Begin indexing the "chain" (group) to which each "link" (row) belongs:
output <- my_data_table %>%
########################################################
# STEP 1: Link the Rows That Are "Close" to Each Other #
########################################################
# Prepare data.table for JOIN, by adding appropriate helper columns.
.[, `:=`(# Uniquely identify each row (by row number).
ID = .I,
# Boundary columns for tolerance threshold.
End_Low = End_Date - tolerance,
End_High = End_Date + tolerance)] %>%
# JOIN rows to each other, to obtain pairings.
.[my_data_table,
# Clearly describe the relation R: x R y whenever the 'Start_Date' of x is
# close enough to (within the boundary columns for) the 'End_Date' of y.
.(x.ID = i.ID, x.Name = i.Name, x.Start_Date = i.Start_Date, x.End_Date = i.End_Date,
y.End_Low = x.End_Low, y.End_High = x.End_High, y.ID = x.ID, y.Name = x.Name),
# JOIN criteria:
on = .(# Only pair rows having the same name.
Name,
# Only pair rows whose start and end dates are within the tolerance
# threshold of each other.
End_Low <= Start_Date,
End_High >= Start_Date),
# Make it an OUTER JOIN, to include those rows without a match.
nomatch = NA] %>%
# Prepare pairings for network analysis.
.[# Ensure no row is reflexively paired with itself.
# NOTE: This keeps the graph clean by trimming extraneous loops, and it
# prevents an "orphan" row from contributing to its own tally of matches.
!(x.ID == y.ID) %in% TRUE,
# !(x.ID == y.ID) %in% TRUE,
# Simplify the dataset to only the pairings (by ID) of linked rows.
.(from = x.ID, to = y.ID)]
#############################
# PAUSE: Count the Matches. #
#############################
# Count how many times each row has its 'End_Date' matched by a 'Start_Date'.
my_data_table$End_Matched <- output %>%
# Include again the missing IDs for y that were never matched by the JOIN.
.[my_data_table[, .(ID)], on = .(to = ID)] %>%
# For each row y, count every other row x where x R y.
.[, .(Matches = sum(!is.na(from))), by = to] %>%
# Extract the count column.
.$Matches
# Count how many times each row has its 'Start_Date' matched by an 'End_Date'.
my_data_table$Start_Matched <- output %>%
# For each row x, count every other row y where x R y.
.[, .(Matches = sum(!is.na(to))), by = from] %>%
# Extract the count column.
.$Matches
#########################################
# RESUME: Continue Indexing the Groups. #
#########################################
# Resume indexing:
output <- output %>%
# Ignore nonmatches (NAs) which are annoying to process into a graph.
.[from != to, ] %>%
###############################################################
# STEP 2: Index the Separate "Chains" Formed By Those "Links" #
###############################################################
# Convert pairings (by ID) of linked rows into an undirected graph.
igraph::graph_from_data_frame(directed = FALSE) %>%
# Find all groups (subgraphs) of transitively linked IDs.
igraph::components() %>%
# Pair each ID with its group index.
igraph::membership() %>%
# Tabulate those pairings...
utils::stack() %>% utils::type.convert(as.is = TRUE) %>%
# ...in a properly named data.table.
data.table::as.data.table() %>% .[, .(ID = ind, Group_Index = values)] %>%
#####################################################
# STEP 3: Match the Original Rows to their "Chains" #
#####################################################
# LEFT JOIN (on ID) to match each original row to its group index (if any).
.[my_data_table, on = .(ID)] %>%
# Transform output into final form.
.[# Sort into original order.
order(ID),
.(# Select existing columns.
Name, Start_Date, End_Date,
# Rename column having the group indices.
Wanted = Group_Index,
# Calculate column(s) to flag rows with sufficient matches.
Flag = (Start_Matched >= flag_at) | (End_Matched >= flag_at))]
# View results.
output
结果
结果output
如下data.table
:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
请记住,Flag
都是 FALSE
,因为您的数据缺少任何 Start_Date
匹配(至少)两个 End_Date
s;连同任何 End_Date
匹配(至少) two Start_Date
s.
假设,如果我们将 flag_at
降低到 1
,那么 output
将 Flag
每一行甚至 单个 匹配(在任一方向):
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 TRUE
4: A 2012-04-16 2015-01-09 1 TRUE
5: A 2002-06-01 2006-02-01 2 TRUE
6: A 2005-12-24 <NA> 2 TRUE
7: B 2018-01-23 <NA> NA FALSE
警告
由于某些 data.table
operations modify by reference(或“就地”),my_data_table
的值在整个工作流程中发生变化。在第 1 步之后,my_data_table
变为
Name Start_Date End_Date ID End_Low End_High
1: A 2015-01-01 2019-12-29 1 2018-12-29 2020-12-29
2: A 2017-03-25 <NA> 2 <NA> <NA>
3: A 2019-10-17 <NA> 3 <NA> <NA>
4: A 2012-04-16 2015-01-09 4 2014-01-09 2016-01-09
5: A 2002-06-01 2006-02-01 5 2005-02-01 2007-02-01
6: A 2005-12-24 <NA> 6 <NA> <NA>
7: B 2018-01-23 <NA> 7 <NA> <NA>
与最初复制的 my_data_frame
结构不同。
由于 dplyr
(在其他包中)按值而不是按引用分配,dplyr
解决方案将完全回避这个问题。
然而,实际上,您在修改工作流程时必须小心,因为第 1 步之前可用的 my_data_table
版本 不能之后恢复。
进一步阅读
虽然 data.table
s 的 JOIN
ing 是明确定向的——有“右”边和“左”边——这个模型设法保留了你描述的 relational symmetry这里
if...[either] one's 'Start_Date' is +- 1 year within the other observation's 'End_Date', they are classified as being in the same group.
通过使用 undirected graph.
当 JOIN
将第 1 行(具有 2015-01-01
的 Start_Date
)与第 4 行(具有 2015-01-09
的 End_Date
)关联时,我们收集到 Start_Date
的 Start_Date
“足够接近”(在 1 年内)的 End_Date
。所以我们在数学上说 ℛ 或
"is in the same group as" .
但是converseℛ不一定会出现在JOIN
ed数据中,因为[=175=的Start_Date
] 在 的 End_Date
附近着陆可能不太方便。也就是说,JOIN
ed数据不一定表示
"is in the same group as" .
在后一种情况下,严格的 directed graph(“二合字母”)将 而不是 捕获同一组的共同成员身份。您可以通过在步骤 2
的第一行中设置directed = TRUE
来观察这个 不和谐的差异
igraph::graph_from_data_frame(directed = TRUE) %>%
并在下一行设置 mode = "strong"
igraph::components(mode = "strong") %>%
产生这些分离的结果:
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 4 FALSE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 3 FALSE
4: A 2012-04-16 2015-01-09 5 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 1 FALSE
7: B 2018-01-23 <NA> NA FALSE
相比之下,可以通过使用无向图 (directed = FALSE
) 正确 对行进行分组;或通过更宽松的标准 (mode = "weak"
)。只要 ℛ 出现在 JOIN
ed 数据中,这些方法中的任何一种都将有效地模拟 ℛ 的存在。
这个对称的 属性 特别 在对您描述的行为进行建模时很重要 here:
...one observation's start date (2016-01-01) is being "fuzzily grouped" with two different end dates (2015-01-02, and 2016-12-31)...
在这种情况下,您希望模型识别任何两行并且必须在同一组 (ℛ) 中,只要它们的 End_Date
匹配其他行的相同 Start_Date
: ℛ 和 ℛ .
所以假设我们知道 ℛ 和 ℛ 。因为我们的模型保留了对称性,所以我们也可以从 ℛ 说 ℛ。由于我们现在知道 ℛ 和 ℛ , transitivity 意味着 ℛ 。因此,我们的模型识别出 ℛ 每当 ℛ 和 ℛ !类似的逻辑就足以“反之亦然”。
我们可以使用
来验证这个结果my_data_frame <- my_data_frame %>%
rbind(list(Name = "A",
Start_Date = as.Date("2010-01-01"),
End_Date = as.Date("2015-01-05")))
在工作流程之前将第 8 行附加到 my_data_frame
:
Name Start_Date End_Date
1 A 2015-01-01 2019-12-29
# ⋮ ⋮ ⋮ ⋮
4 A 2012-04-16 2015-01-09
# ⋮ ⋮ ⋮ ⋮
8 A 2010-01-01 2015-01-05
第 8 行作为我们的 ,其中 是第 1 行, 是第 4 行,和以前一样。实际上,output
正确地将 and 和 分类为属于同一组 1
:ℛ .
Name Start_Date End_Date Wanted Flag
1: A 2015-01-01 2019-12-29 1 TRUE
2: A 2017-03-25 <NA> NA FALSE
3: A 2019-10-17 <NA> 1 FALSE
4: A 2012-04-16 2015-01-09 1 FALSE
5: A 2002-06-01 2006-02-01 2 FALSE
6: A 2005-12-24 <NA> 2 FALSE
7: B 2018-01-23 <NA> NA FALSE
8: A 2010-01-01 2015-01-05 1 FALSE
同样,output
正确地 Flag
在第一行,其 Start_Date
现在与两个 End_Date
匹配:在第 4 行和第 8 行。