R - 仅对最近的整数进行模糊连接
R - fuzzy join on nearest integer only
假设我有这个数据集开始,在这个愚蠢的布局中:
originalDF <- data.frame(
Index = 1:14,
Field = c("Name", "Weight", "Age", "Name", "Weight", "Age", "Height", "Name", "Weight", "Age", "Height", "Name", "Age", "Height"),
Value = c("Sara", "115", "17", "Bob", "158", "22", "72", "Irv", "210", "42", "68", "Fred", "155", "65")
)
我希望它看起来像这样:
基本上,我想将体重、年龄和身高行与其上方的姓名行相匹配。使用 dplyr
:
很容易拆分数据
namesDF <- originalDF %>%
filter(Field == "Name")
detailsDF <- originalDF %>%
filter(!Field == "Name")
从这里开始,使用索引(行号)似乎是最好的方法,即将 detailsDF
中的每一行与 namesDF
中具有最接近索引的条目相匹配,而无需越过。我使用了 fuzzyjoin
包并加入了
fuzzy_left_join(detailsDF, namesDF, by = "Index", match_fun = list(`>`))
这种类型的有效,但它也将detailsDF
中的每一行与namesDF
中的每一行连接起来,索引号较小:
我想出了一个解决方案,使用到下一个索引的距离并以这种方式过滤掉多余的行,但我想避免这样做;实际的源文件将超过 200k 行,而带有额外行的临时结果数据帧将太大而无法放入内存。有什么我可以在这里做的吗?谢谢!
您可以使用
x = which(originalDF$Field == "Name")
originalDF$Name = rep(originalDF$Value[x], times = diff(c(x, NROW(originalDF)+1)))
NewDF = originalDF[originalDF$Field != 'Name', c(4,2,3)]
# Name Field Value
# 2 Sara Weight 115
# 3 Sara Age 17
# 5 Bob Weight 158
# 6 Bob Age 22
# 7 Bob Height 72
# 9 Irv Weight 210
# 10 Irv Age 42
# 11 Irv Height 68
# 13 Fred Age 155
# 14 Fred Height 65
我建议以不同的方式处理它,即跟踪每个点的最新 "Name" 值。 tidyr 包中的 fill()
对此很有用。
library(dplyr)
library(tidyr)
originalDF %>%
mutate(Name = ifelse(Field == "Name", as.character(Value), NA)) %>%
fill(Name) %>%
filter(Field != "Name")
输出:
Index Field Value Name
1 2 Weight 115 Sara
2 3 Age 17 Sara
3 5 Weight 158 Bob
4 6 Age 22 Bob
5 7 Height 72 Bob
6 9 Weight 210 Irv
7 10 Age 42 Irv
8 11 Height 68 Irv
9 13 Age 155 Fred
10 14 Height 65 Fred
但是,如果您确实想使用 fuzzyjoin 方法,您可以在结果上使用 group_by()
和 slice()
来实现此目的,其中您为每个 [=17 的值获取最后一行=].
fuzzy_left_join(detailsDF, namesDF, by = "Index", match_fun = list(`>`)) %>%
group_by(Index.x) %>%
slice(n()) %>%
ungroup()
输出:
# A tibble: 10 x 6
Index.x Field.x Value.x Index.y Field.y Value.y
<int> <fct> <fct> <int> <fct> <fct>
1 2 Weight 115 1 Name Sara
2 3 Age 17 1 Name Sara
3 5 Weight 158 4 Name Bob
4 6 Age 22 4 Name Bob
5 7 Height 72 4 Name Bob
6 9 Weight 210 8 Name Irv
7 10 Age 42 8 Name Irv
8 11 Height 68 8 Name Irv
9 13 Age 155 12 Name Fred
10 14 Height 65 12 Name Fred
您可以按 cumsum(Field == "Name")
分组。使用 dplyr...
library(dplyr)
originalDF %>%
group_by(Name = Value[Field == "Name"][cumsum(Field == "Name")]) %>%
slice(-1) %>% select(c("Name", "Field", "Value"))
# A tibble: 10 x 3
# Groups: Name [4]
Name Field Value
<fct> <fct> <fct>
1 Bob Weight 158
2 Bob Age 22
3 Bob Height 72
4 Fred Age 155
5 Fred Height 65
6 Irv Weight 210
7 Irv Age 42
8 Irv Height 68
9 Sara Weight 115
10 Sara Age 17
与data.table...
library(data.table)
data.table(originalDF)[,
.SD[-1],
by=.(Name = Value[Field == "Name"][cumsum(Field == "Name")]), .SDcols=c("Field", "Value")]
假设我有这个数据集开始,在这个愚蠢的布局中:
originalDF <- data.frame(
Index = 1:14,
Field = c("Name", "Weight", "Age", "Name", "Weight", "Age", "Height", "Name", "Weight", "Age", "Height", "Name", "Age", "Height"),
Value = c("Sara", "115", "17", "Bob", "158", "22", "72", "Irv", "210", "42", "68", "Fred", "155", "65")
)
我希望它看起来像这样:
基本上,我想将体重、年龄和身高行与其上方的姓名行相匹配。使用 dplyr
:
namesDF <- originalDF %>%
filter(Field == "Name")
detailsDF <- originalDF %>%
filter(!Field == "Name")
从这里开始,使用索引(行号)似乎是最好的方法,即将 detailsDF
中的每一行与 namesDF
中具有最接近索引的条目相匹配,而无需越过。我使用了 fuzzyjoin
包并加入了
fuzzy_left_join(detailsDF, namesDF, by = "Index", match_fun = list(`>`))
这种类型的有效,但它也将detailsDF
中的每一行与namesDF
中的每一行连接起来,索引号较小:
我想出了一个解决方案,使用到下一个索引的距离并以这种方式过滤掉多余的行,但我想避免这样做;实际的源文件将超过 200k 行,而带有额外行的临时结果数据帧将太大而无法放入内存。有什么我可以在这里做的吗?谢谢!
您可以使用
x = which(originalDF$Field == "Name")
originalDF$Name = rep(originalDF$Value[x], times = diff(c(x, NROW(originalDF)+1)))
NewDF = originalDF[originalDF$Field != 'Name', c(4,2,3)]
# Name Field Value
# 2 Sara Weight 115
# 3 Sara Age 17
# 5 Bob Weight 158
# 6 Bob Age 22
# 7 Bob Height 72
# 9 Irv Weight 210
# 10 Irv Age 42
# 11 Irv Height 68
# 13 Fred Age 155
# 14 Fred Height 65
我建议以不同的方式处理它,即跟踪每个点的最新 "Name" 值。 tidyr 包中的 fill()
对此很有用。
library(dplyr)
library(tidyr)
originalDF %>%
mutate(Name = ifelse(Field == "Name", as.character(Value), NA)) %>%
fill(Name) %>%
filter(Field != "Name")
输出:
Index Field Value Name
1 2 Weight 115 Sara
2 3 Age 17 Sara
3 5 Weight 158 Bob
4 6 Age 22 Bob
5 7 Height 72 Bob
6 9 Weight 210 Irv
7 10 Age 42 Irv
8 11 Height 68 Irv
9 13 Age 155 Fred
10 14 Height 65 Fred
但是,如果您确实想使用 fuzzyjoin 方法,您可以在结果上使用 group_by()
和 slice()
来实现此目的,其中您为每个 [=17 的值获取最后一行=].
fuzzy_left_join(detailsDF, namesDF, by = "Index", match_fun = list(`>`)) %>%
group_by(Index.x) %>%
slice(n()) %>%
ungroup()
输出:
# A tibble: 10 x 6
Index.x Field.x Value.x Index.y Field.y Value.y
<int> <fct> <fct> <int> <fct> <fct>
1 2 Weight 115 1 Name Sara
2 3 Age 17 1 Name Sara
3 5 Weight 158 4 Name Bob
4 6 Age 22 4 Name Bob
5 7 Height 72 4 Name Bob
6 9 Weight 210 8 Name Irv
7 10 Age 42 8 Name Irv
8 11 Height 68 8 Name Irv
9 13 Age 155 12 Name Fred
10 14 Height 65 12 Name Fred
您可以按 cumsum(Field == "Name")
分组。使用 dplyr...
library(dplyr)
originalDF %>%
group_by(Name = Value[Field == "Name"][cumsum(Field == "Name")]) %>%
slice(-1) %>% select(c("Name", "Field", "Value"))
# A tibble: 10 x 3
# Groups: Name [4]
Name Field Value
<fct> <fct> <fct>
1 Bob Weight 158
2 Bob Age 22
3 Bob Height 72
4 Fred Age 155
5 Fred Height 65
6 Irv Weight 210
7 Irv Age 42
8 Irv Height 68
9 Sara Weight 115
10 Sara Age 17
与data.table...
library(data.table)
data.table(originalDF)[,
.SD[-1],
by=.(Name = Value[Field == "Name"][cumsum(Field == "Name")]), .SDcols=c("Field", "Value")]