在 R / tidyr 中处理巨大的嵌套数据集
handling enormous nested datasets in R / tidyr
- 我有来自神经网络的单词识别模拟的数据文件。
- 网络有 1000 个单词的词典。
- 输出有 30,000 个节点 -- 每个单词以不同的对齐方式有 30 个副本。
- I 运行 将每个单词作为输入进行单独模拟(1000 次模拟)。输出是一个像这样的 100 步时间序列(此处显示 2 个输入示例(ark、bark),每个单词有 4 个 Copies,仅跟踪 4 个单词,并且仅显示 5 个时间步长)
[编辑:2021 年 5 月 3 日,数据集现在包含以前的解决方案无法处理的现实条件。对于更改数据,我深表歉意,但我没有看到更好的方法来澄清之前建议的解决方案中的差距。]
xf = read.table(header=T, sep=",", text="
Input,Time,Word,Copy1,Copy2,Copy3,Copy30
ark,10,ark,-0.1,-0.1,-0.1,-0.1
ark,20,ark,0.0,0.5,0.55,0.01
ark,30,ark,0.01,0.1,0.2,0.05
ark,40,ark,0.02,0.3,0.5,0.1
ark,50,ark,0.01,0.2,0.4,-0.1
ark,10,ad,-0.1,-0.1,-0.1,-0.1
ark,20,ad,0.0,0.01,0.02,0.01
ark,30,ad,0.01,0.03,0.1,0.04
ark,40,ad,0.02,0.12,0.15,0.04
ark,50,ad,0.01,0.01,0.05,0.02
ark,10,bark,-0.1,-0.1,-0.1,-0.1
ark,20,bark,0.02,0.12,0.1,0.01
ark,30,bark,0.03,0.15,0.12,0.02
ark,40,bark,0.02,0.22,0.1,0.03
ark,50,bark,0.01,0.1,0.05,0.02
ark,10,bar,-0.1,-0.1,-0.1,-0.1
ark,20,bar,0.01,0.1,0.02,-0.05
ark,30,bar,0.01,0.12,0.03,0
ark,40,bar,0.02,0.15,0.03,0.01
ark,50,bar,0.01,0.05,0.02,0.01
bark,10,ark,-0.1,-0.1,-0.1,-0.1
bark,20,ark,0.0,0.04,0.05,0.01
bark,30,ark,0.01,0.08,0.1,0.05
bark,40,ark,0.02,0.05,0.2,0.1
bark,50,ark,0.01,0.01,0.3,-0.1
bark,10,ad,-0.1,-0.1,-0.1,-0.1
bark,20,ad,0.0,0.01,0.01,0.01
bark,30,ad,0.01,0.02,0.05,0.04
bark,40,ad,0.02,0.03,0.06,0.04
bark,50,ad,0.01,0.02,0.01,0.02
bark,10,bark,-0.1,-0.1,-0.1,-0.1
bark,20,bark,0.02,0.15,0.1,0.01
bark,30,bark,0.03,0.3,0.12,0.02
bark,40,bark,0.02,0.7,0.1,0.03
bark,50,bark,0.01,0.7,0.05,0.02
bark,10,bar,-0.1,-0.1,-0.1,-0.1
bark,20,bar,0.01,0.13,0.04,-0.05
bark,30,bar,0.01,0.25,0.06,0
bark,40,bar,0.02,0.4,0.08,0.01
bark,50,bar,0.01,0.35,0.01,0.01
") %>% arrange(Input,Word,Time)
我想通过两种方式减少这些数据。
(1) 对于每个 Input x Word 组合,select one 根据整个时间序列的最大值复制一个单词,并且
(2) 基于保留副本的最大值(每个输入 x 字 1 个),减少到 'topX' 个字。
我原来的问题不清楚,变得很笨拙。 @DanChaltiel 使用 pivot_longer 提供了部分答案,非常接近完整的解决方案,但我无法清楚地解释第一个减少。所以我将其分解为 ,其中 @ak运行 像这样扩展了 @DanChaltiel 的解决方案,解决了第一部分(2021 年 5 月 3 日更新以反映对解决方案的修复):
library(tidyverse)
# Reduce data to one Copy of each Input x Word combination
# based on maxima for entire time series, no matter what
# Time those maxima occur. Using pivot_longer was due to
# answer from @DanChaltiel, but getting it to work on
# Input x Word maxima over the whole time series (rather
# than maxima of Input x Word x Time) was due to @akrun
# for
xf2 <- xf %>%
pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'Value') %>%
group_by(Input, Time, Word) %>%
arrange(Value) %>%
slice(if(all(Value <= 0)) n()
else tail(which(Value > 0), 1))%>%
group_by(Input, Word) %>%
mutate(copy_name = copy_name[which.max(Value)]) %>%
ungroup
print((xf2 %>% arrange(Input, Word)), n = nrow(xf2)) # print all rows
# A tibble: 40 x 5
# Input Time Word copy_name Value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 10 ad Copy3 -0.1
# 2 ark 20 ad Copy3 0.02
# 3 ark 30 ad Copy3 0.1
# 4 ark 40 ad Copy3 0.15
# 5 ark 50 ad Copy3 0.05
# 6 ark 10 ark Copy3 -0.1
# 7 ark 20 ark Copy3 0.55
# 8 ark 30 ark Copy3 0.2
# 9 ark 40 ark Copy3 0.5
# 10 ark 50 ark Copy3 0.4
# 11 ark 10 bar Copy2 -0.1
# 12 ark 20 bar Copy2 0.1
# 13 ark 30 bar Copy2 0.12
# 14 ark 40 bar Copy2 0.15
# 15 ark 50 bar Copy2 0.05
# 16 ark 10 bark Copy2 -0.1
# 17 ark 20 bark Copy2 0.12
# 18 ark 30 bark Copy2 0.15
# 19 ark 40 bark Copy2 0.22
# 20 ark 50 bark Copy2 0.1
# 21 bark 10 ad Copy3 -0.1
# 22 bark 20 ad Copy3 0.01
# 23 bark 30 ad Copy3 0.05
# 24 bark 40 ad Copy3 0.06
# 25 bark 50 ad Copy3 0.02
# 26 bark 10 ark Copy3 -0.1
# 27 bark 20 ark Copy3 0.05
# 28 bark 30 ark Copy3 0.1
# 29 bark 40 ark Copy3 0.2
# 30 bark 50 ark Copy3 0.3
# 31 bark 10 bar Copy2 -0.1
# 32 bark 20 bar Copy2 0.13
# 33 bark 30 bar Copy2 0.25
# 34 bark 40 bar Copy2 0.4
# 35 bark 50 bar Copy2 0.35
# 36 bark 10 bark Copy2 -0.1
# 37 bark 20 bark Copy2 0.15
# 38 bark 30 bark Copy2 0.3
# 39 bark 40 bark Copy2 0.7
# 40 bark 50 bark Copy2 0.7
因此,这成功地将数据减少为基于时间 1..100 系列中的最大值的每个输入 x 字组合的单个副本。
第二个挑战是将数据减少到每个输入的前 X 个词。
@AnilGoyal 建议的方法适用于更简单的样本数据,但由于所包含的时间步数与 topX 的值之间存在偶然的偶然性。
到目前为止,根据@AnilGoyal 的示例,我能够做的是根据每个输入的最大值识别每个输入的 topX 个词。以下是找到前 3 名和前 2 名的 2 个示例:
topX = 3
xftop3 <- xf2 %>% group_by(Input, Word) %>%
slice_max(Value, with_ties=FALSE) %>%
arrange(desc(Value)) %>%
group_by(Input) %>%
filter(1:n() <= topX) %>%
arrange(Input, Value)
xftop3
# A tibble: 6 x 5
# Groups: Input [2]
# Input Time Word copy_name Value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 40 ad Copy3 0.15
# 2 ark 40 bark Copy2 0.22
# 3 ark 20 ark Copy3 0.55
# 4 bark 50 ark Copy3 0.3
# 5 bark 40 bar Copy2 0.4
# 6 bark 40 bark Copy2 0.7
topX = 2
xftop2 <- xf2 %>% group_by(Input, Word) %>%
slice_max(Value, with_ties=FALSE) %>%
arrange(desc(Value)) %>%
group_by(Input) %>%
filter(1:n() <= topX) %>%
arrange(Input, Value)
xftop2
# A tibble: 4 x 5
# Groups: Input [2]
# Input Time Word copy_name Value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 40 bark Copy2 0.22
# 2 ark 20 ark Copy3 0.55
# 3 bark 40 bar Copy2 0.4
# 4 bark 40 bark Copy2 0.7
我不知道该怎么做,然后使用该 tibble 将数据集减少到始终只有那些输入 x 字组合。示例数据和 topX = 2 的所需输出为:
# A tibble: 20 x 5
Input Time Word copy_name Value
<fct> <int> <fct> <chr> <dbl>
1 ark 10 ark Copy3 -0.1
2 ark 20 ark Copy3 0.55
3 ark 30 ark Copy3 0.2
4 ark 40 ark Copy3 0.5
5 ark 50 ark Copy3 0.4
6 ark 10 bark Copy2 -0.1
7 ark 20 bark Copy2 0.12
8 ark 30 bark Copy2 0.15
9 ark 40 bark Copy2 0.22
10 ark 50 bark Copy2 0.1
11 bark 10 bar Copy2 -0.1
12 bark 20 bar Copy2 0.13
13 bark 30 bar Copy2 0.25
14 bark 40 bar Copy2 0.4
15 bark 50 bar Copy2 0.35
16 bark 10 bark Copy2 -0.1
17 bark 20 bark Copy2 0.15
18 bark 30 bark Copy2 0.3
19 bark 40 bark Copy2 0.7
20 bark 50 bark Copy2 0.7
如有任何建议,我将不胜感激。
This answer has been rewritten (twice), see the edition log for the record.
您的问题有两个步骤:
-
- 找到最大值
-
- select 与这些最大值相关的行
找到最大值是一个简单的过滤问题。但是,您可能想要 select 基于 mean/median 的输入词对,而不是在单个时间出现最大值。这将是一个总结问题 (dplyr::summarise()
.
一旦你有了对,你只需要 select 正确的行。可能有很多方法,但我选择使用 right_join()
.
出于教学目的,我选择将这些步骤分开,但您显然可以将它们合并到一个管道中。
topX=2
xf2bis = xf2 %>%
group_by(Input, Word) %>%
filter(rank(Value, ties.method="first") == n()) %>%
group_by(Input) %>%
filter(rank(Value, ties.method="first") > n() - topX) %>%
select(Input, Word)
xf2bis
#> # A tibble: 4 x 2
#> # Groups: Input [2]
#> Input Word
#> <chr> <chr>
#> 1 ark ark
#> 2 ark bark
#> 3 bark bar
#> 4 bark bark
xftop2 = xf2 %>%
right_join(xf2bis, by=c("Input", "Word"))
xftop2
#> # A tibble: 20 x 5
#> Input Time Word copy_name Value
#> <chr> <int> <chr> <chr> <dbl>
#> 1 ark 10 ark Copy3 -0.1
#> 2 ark 10 bark Copy2 -0.1
#> 3 ark 20 ark Copy3 0.55
#> 4 ark 20 bark Copy2 0.12
#> 5 ark 30 ark Copy3 0.2
#> 6 ark 30 bark Copy2 0.15
#> 7 ark 40 ark Copy3 0.5
#> 8 ark 40 bark Copy2 0.22
#> 9 ark 50 ark Copy3 0.4
#> 10 ark 50 bark Copy2 0.1
#> 11 bark 10 bar Copy2 -0.1
#> 12 bark 10 bark Copy2 -0.1
#> 13 bark 20 bar Copy2 0.13
#> 14 bark 20 bark Copy2 0.15
#> 15 bark 30 bar Copy2 0.25
#> 16 bark 30 bark Copy2 0.3
#> 17 bark 40 bar Copy2 0.4
#> 18 bark 40 bark Copy2 0.7
#> 19 bark 50 bar Copy2 0.35
#> 20 bark 50 bark Copy2 0.7
由 reprex package (v2.0.0)
于 2021-05-04 创建
到目前为止,我的理解是,您需要 semi_join
,它仅根据右侧数据参数中可用的 row_combinations 过滤数据(左侧)。
topX = 2
semi_join(xf2, xf2 %>% group_by(Input, Word) %>%
slice_max(Value, with_ties=FALSE) %>%
group_by(Input) %>%
slice_max(Value, n= topX, with_ties = FALSE) %>%
select(Input, Word), by = c('Input', 'Word'))
# A tibble: 20 x 5
Input Time Word copy_name Value
<chr> <int> <chr> <chr> <dbl>
1 ark 10 ark Copy3 -0.1
2 ark 10 bark Copy2 -0.1
3 ark 20 ark Copy3 0.55
4 ark 20 bark Copy2 0.12
5 ark 30 ark Copy3 0.2
6 ark 30 bark Copy2 0.15
7 ark 40 ark Copy3 0.5
8 ark 40 bark Copy2 0.22
9 ark 50 ark Copy3 0.4
10 ark 50 bark Copy2 0.1
11 bark 10 bar Copy2 -0.1
12 bark 10 bark Copy2 -0.1
13 bark 20 bar Copy2 0.13
14 bark 20 bark Copy2 0.15
15 bark 30 bar Copy2 0.25
16 bark 30 bark Copy2 0.3
17 bark 40 bar Copy2 0.4
18 bark 40 bark Copy2 0.7
19 bark 50 bar Copy2 0.35
20 bark 50 bark Copy2 0.7
你的第二部分将由这段代码解决
topX <- 2L
xf2 %>% semi_join(xf2 %>% group_by(Input) %>%
slice_max(Value, n= topX) %>% select(Input, Word), by = c("Input", "Word"))
# A tibble: 12 x 5
Input Time Word copy_name Value
<chr> <int> <chr> <chr> <dbl>
1 ark 1 ark Copy3 0
2 ark 1 bark Copy2 0
3 ark 50 ark Copy3 0.05
4 ark 50 bark Copy2 0.06
5 ark 100 ark Copy3 0.55
6 ark 100 bark Copy2 0.2
7 bark 1 bar Copy2 0
8 bark 1 bark Copy2 0
9 bark 50 bar Copy2 0.7
10 bark 50 bark Copy2 0.75
11 bark 100 bar Copy2 0.4
12 bark 100 bark Copy2 0.6
我认为作为第一部分你也可以使用这个semi_join
请检查
xf %>%
pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'Value') %>%
semi_join(xf %>% pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'Value') %>%
group_by(Input, Word) %>%
slice_max(Value) %>%
select(Input, Word, copy_name),
by = c('Input', 'Word', 'copy_name')) -> xf2
- 我有来自神经网络的单词识别模拟的数据文件。
- 网络有 1000 个单词的词典。
- 输出有 30,000 个节点 -- 每个单词以不同的对齐方式有 30 个副本。
- I 运行 将每个单词作为输入进行单独模拟(1000 次模拟)。输出是一个像这样的 100 步时间序列(此处显示 2 个输入示例(ark、bark),每个单词有 4 个 Copies,仅跟踪 4 个单词,并且仅显示 5 个时间步长)
[编辑:2021 年 5 月 3 日,数据集现在包含以前的解决方案无法处理的现实条件。对于更改数据,我深表歉意,但我没有看到更好的方法来澄清之前建议的解决方案中的差距。]
xf = read.table(header=T, sep=",", text="
Input,Time,Word,Copy1,Copy2,Copy3,Copy30
ark,10,ark,-0.1,-0.1,-0.1,-0.1
ark,20,ark,0.0,0.5,0.55,0.01
ark,30,ark,0.01,0.1,0.2,0.05
ark,40,ark,0.02,0.3,0.5,0.1
ark,50,ark,0.01,0.2,0.4,-0.1
ark,10,ad,-0.1,-0.1,-0.1,-0.1
ark,20,ad,0.0,0.01,0.02,0.01
ark,30,ad,0.01,0.03,0.1,0.04
ark,40,ad,0.02,0.12,0.15,0.04
ark,50,ad,0.01,0.01,0.05,0.02
ark,10,bark,-0.1,-0.1,-0.1,-0.1
ark,20,bark,0.02,0.12,0.1,0.01
ark,30,bark,0.03,0.15,0.12,0.02
ark,40,bark,0.02,0.22,0.1,0.03
ark,50,bark,0.01,0.1,0.05,0.02
ark,10,bar,-0.1,-0.1,-0.1,-0.1
ark,20,bar,0.01,0.1,0.02,-0.05
ark,30,bar,0.01,0.12,0.03,0
ark,40,bar,0.02,0.15,0.03,0.01
ark,50,bar,0.01,0.05,0.02,0.01
bark,10,ark,-0.1,-0.1,-0.1,-0.1
bark,20,ark,0.0,0.04,0.05,0.01
bark,30,ark,0.01,0.08,0.1,0.05
bark,40,ark,0.02,0.05,0.2,0.1
bark,50,ark,0.01,0.01,0.3,-0.1
bark,10,ad,-0.1,-0.1,-0.1,-0.1
bark,20,ad,0.0,0.01,0.01,0.01
bark,30,ad,0.01,0.02,0.05,0.04
bark,40,ad,0.02,0.03,0.06,0.04
bark,50,ad,0.01,0.02,0.01,0.02
bark,10,bark,-0.1,-0.1,-0.1,-0.1
bark,20,bark,0.02,0.15,0.1,0.01
bark,30,bark,0.03,0.3,0.12,0.02
bark,40,bark,0.02,0.7,0.1,0.03
bark,50,bark,0.01,0.7,0.05,0.02
bark,10,bar,-0.1,-0.1,-0.1,-0.1
bark,20,bar,0.01,0.13,0.04,-0.05
bark,30,bar,0.01,0.25,0.06,0
bark,40,bar,0.02,0.4,0.08,0.01
bark,50,bar,0.01,0.35,0.01,0.01
") %>% arrange(Input,Word,Time)
我想通过两种方式减少这些数据。
(1) 对于每个 Input x Word 组合,select one 根据整个时间序列的最大值复制一个单词,并且
(2) 基于保留副本的最大值(每个输入 x 字 1 个),减少到 'topX' 个字。
我原来的问题不清楚,变得很笨拙。 @DanChaltiel 使用 pivot_longer 提供了部分答案,非常接近完整的解决方案,但我无法清楚地解释第一个减少。所以我将其分解为
library(tidyverse)
# Reduce data to one Copy of each Input x Word combination
# based on maxima for entire time series, no matter what
# Time those maxima occur. Using pivot_longer was due to
# answer from @DanChaltiel, but getting it to work on
# Input x Word maxima over the whole time series (rather
# than maxima of Input x Word x Time) was due to @akrun
# for
xf2 <- xf %>%
pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'Value') %>%
group_by(Input, Time, Word) %>%
arrange(Value) %>%
slice(if(all(Value <= 0)) n()
else tail(which(Value > 0), 1))%>%
group_by(Input, Word) %>%
mutate(copy_name = copy_name[which.max(Value)]) %>%
ungroup
print((xf2 %>% arrange(Input, Word)), n = nrow(xf2)) # print all rows
# A tibble: 40 x 5
# Input Time Word copy_name Value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 10 ad Copy3 -0.1
# 2 ark 20 ad Copy3 0.02
# 3 ark 30 ad Copy3 0.1
# 4 ark 40 ad Copy3 0.15
# 5 ark 50 ad Copy3 0.05
# 6 ark 10 ark Copy3 -0.1
# 7 ark 20 ark Copy3 0.55
# 8 ark 30 ark Copy3 0.2
# 9 ark 40 ark Copy3 0.5
# 10 ark 50 ark Copy3 0.4
# 11 ark 10 bar Copy2 -0.1
# 12 ark 20 bar Copy2 0.1
# 13 ark 30 bar Copy2 0.12
# 14 ark 40 bar Copy2 0.15
# 15 ark 50 bar Copy2 0.05
# 16 ark 10 bark Copy2 -0.1
# 17 ark 20 bark Copy2 0.12
# 18 ark 30 bark Copy2 0.15
# 19 ark 40 bark Copy2 0.22
# 20 ark 50 bark Copy2 0.1
# 21 bark 10 ad Copy3 -0.1
# 22 bark 20 ad Copy3 0.01
# 23 bark 30 ad Copy3 0.05
# 24 bark 40 ad Copy3 0.06
# 25 bark 50 ad Copy3 0.02
# 26 bark 10 ark Copy3 -0.1
# 27 bark 20 ark Copy3 0.05
# 28 bark 30 ark Copy3 0.1
# 29 bark 40 ark Copy3 0.2
# 30 bark 50 ark Copy3 0.3
# 31 bark 10 bar Copy2 -0.1
# 32 bark 20 bar Copy2 0.13
# 33 bark 30 bar Copy2 0.25
# 34 bark 40 bar Copy2 0.4
# 35 bark 50 bar Copy2 0.35
# 36 bark 10 bark Copy2 -0.1
# 37 bark 20 bark Copy2 0.15
# 38 bark 30 bark Copy2 0.3
# 39 bark 40 bark Copy2 0.7
# 40 bark 50 bark Copy2 0.7
因此,这成功地将数据减少为基于时间 1..100 系列中的最大值的每个输入 x 字组合的单个副本。
第二个挑战是将数据减少到每个输入的前 X 个词。
@AnilGoyal 建议的方法适用于更简单的样本数据,但由于所包含的时间步数与 topX 的值之间存在偶然的偶然性。
到目前为止,根据@AnilGoyal 的示例,我能够做的是根据每个输入的最大值识别每个输入的 topX 个词。以下是找到前 3 名和前 2 名的 2 个示例:
topX = 3
xftop3 <- xf2 %>% group_by(Input, Word) %>%
slice_max(Value, with_ties=FALSE) %>%
arrange(desc(Value)) %>%
group_by(Input) %>%
filter(1:n() <= topX) %>%
arrange(Input, Value)
xftop3
# A tibble: 6 x 5
# Groups: Input [2]
# Input Time Word copy_name Value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 40 ad Copy3 0.15
# 2 ark 40 bark Copy2 0.22
# 3 ark 20 ark Copy3 0.55
# 4 bark 50 ark Copy3 0.3
# 5 bark 40 bar Copy2 0.4
# 6 bark 40 bark Copy2 0.7
topX = 2
xftop2 <- xf2 %>% group_by(Input, Word) %>%
slice_max(Value, with_ties=FALSE) %>%
arrange(desc(Value)) %>%
group_by(Input) %>%
filter(1:n() <= topX) %>%
arrange(Input, Value)
xftop2
# A tibble: 4 x 5
# Groups: Input [2]
# Input Time Word copy_name Value
# <fct> <int> <fct> <chr> <dbl>
# 1 ark 40 bark Copy2 0.22
# 2 ark 20 ark Copy3 0.55
# 3 bark 40 bar Copy2 0.4
# 4 bark 40 bark Copy2 0.7
我不知道该怎么做,然后使用该 tibble 将数据集减少到始终只有那些输入 x 字组合。示例数据和 topX = 2 的所需输出为:
# A tibble: 20 x 5
Input Time Word copy_name Value
<fct> <int> <fct> <chr> <dbl>
1 ark 10 ark Copy3 -0.1
2 ark 20 ark Copy3 0.55
3 ark 30 ark Copy3 0.2
4 ark 40 ark Copy3 0.5
5 ark 50 ark Copy3 0.4
6 ark 10 bark Copy2 -0.1
7 ark 20 bark Copy2 0.12
8 ark 30 bark Copy2 0.15
9 ark 40 bark Copy2 0.22
10 ark 50 bark Copy2 0.1
11 bark 10 bar Copy2 -0.1
12 bark 20 bar Copy2 0.13
13 bark 30 bar Copy2 0.25
14 bark 40 bar Copy2 0.4
15 bark 50 bar Copy2 0.35
16 bark 10 bark Copy2 -0.1
17 bark 20 bark Copy2 0.15
18 bark 30 bark Copy2 0.3
19 bark 40 bark Copy2 0.7
20 bark 50 bark Copy2 0.7
如有任何建议,我将不胜感激。
This answer has been rewritten (twice), see the edition log for the record.
您的问题有两个步骤:
-
- 找到最大值
-
- select 与这些最大值相关的行
找到最大值是一个简单的过滤问题。但是,您可能想要 select 基于 mean/median 的输入词对,而不是在单个时间出现最大值。这将是一个总结问题 (dplyr::summarise()
.
一旦你有了对,你只需要 select 正确的行。可能有很多方法,但我选择使用 right_join()
.
出于教学目的,我选择将这些步骤分开,但您显然可以将它们合并到一个管道中。
topX=2
xf2bis = xf2 %>%
group_by(Input, Word) %>%
filter(rank(Value, ties.method="first") == n()) %>%
group_by(Input) %>%
filter(rank(Value, ties.method="first") > n() - topX) %>%
select(Input, Word)
xf2bis
#> # A tibble: 4 x 2
#> # Groups: Input [2]
#> Input Word
#> <chr> <chr>
#> 1 ark ark
#> 2 ark bark
#> 3 bark bar
#> 4 bark bark
xftop2 = xf2 %>%
right_join(xf2bis, by=c("Input", "Word"))
xftop2
#> # A tibble: 20 x 5
#> Input Time Word copy_name Value
#> <chr> <int> <chr> <chr> <dbl>
#> 1 ark 10 ark Copy3 -0.1
#> 2 ark 10 bark Copy2 -0.1
#> 3 ark 20 ark Copy3 0.55
#> 4 ark 20 bark Copy2 0.12
#> 5 ark 30 ark Copy3 0.2
#> 6 ark 30 bark Copy2 0.15
#> 7 ark 40 ark Copy3 0.5
#> 8 ark 40 bark Copy2 0.22
#> 9 ark 50 ark Copy3 0.4
#> 10 ark 50 bark Copy2 0.1
#> 11 bark 10 bar Copy2 -0.1
#> 12 bark 10 bark Copy2 -0.1
#> 13 bark 20 bar Copy2 0.13
#> 14 bark 20 bark Copy2 0.15
#> 15 bark 30 bar Copy2 0.25
#> 16 bark 30 bark Copy2 0.3
#> 17 bark 40 bar Copy2 0.4
#> 18 bark 40 bark Copy2 0.7
#> 19 bark 50 bar Copy2 0.35
#> 20 bark 50 bark Copy2 0.7
由 reprex package (v2.0.0)
于 2021-05-04 创建到目前为止,我的理解是,您需要 semi_join
,它仅根据右侧数据参数中可用的 row_combinations 过滤数据(左侧)。
topX = 2
semi_join(xf2, xf2 %>% group_by(Input, Word) %>%
slice_max(Value, with_ties=FALSE) %>%
group_by(Input) %>%
slice_max(Value, n= topX, with_ties = FALSE) %>%
select(Input, Word), by = c('Input', 'Word'))
# A tibble: 20 x 5
Input Time Word copy_name Value
<chr> <int> <chr> <chr> <dbl>
1 ark 10 ark Copy3 -0.1
2 ark 10 bark Copy2 -0.1
3 ark 20 ark Copy3 0.55
4 ark 20 bark Copy2 0.12
5 ark 30 ark Copy3 0.2
6 ark 30 bark Copy2 0.15
7 ark 40 ark Copy3 0.5
8 ark 40 bark Copy2 0.22
9 ark 50 ark Copy3 0.4
10 ark 50 bark Copy2 0.1
11 bark 10 bar Copy2 -0.1
12 bark 10 bark Copy2 -0.1
13 bark 20 bar Copy2 0.13
14 bark 20 bark Copy2 0.15
15 bark 30 bar Copy2 0.25
16 bark 30 bark Copy2 0.3
17 bark 40 bar Copy2 0.4
18 bark 40 bark Copy2 0.7
19 bark 50 bar Copy2 0.35
20 bark 50 bark Copy2 0.7
你的第二部分将由这段代码解决
topX <- 2L
xf2 %>% semi_join(xf2 %>% group_by(Input) %>%
slice_max(Value, n= topX) %>% select(Input, Word), by = c("Input", "Word"))
# A tibble: 12 x 5
Input Time Word copy_name Value
<chr> <int> <chr> <chr> <dbl>
1 ark 1 ark Copy3 0
2 ark 1 bark Copy2 0
3 ark 50 ark Copy3 0.05
4 ark 50 bark Copy2 0.06
5 ark 100 ark Copy3 0.55
6 ark 100 bark Copy2 0.2
7 bark 1 bar Copy2 0
8 bark 1 bark Copy2 0
9 bark 50 bar Copy2 0.7
10 bark 50 bark Copy2 0.75
11 bark 100 bar Copy2 0.4
12 bark 100 bark Copy2 0.6
我认为作为第一部分你也可以使用这个semi_join
请检查
xf %>%
pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'Value') %>%
semi_join(xf %>% pivot_longer(cols = starts_with('Copy'), names_to = 'copy_name',
values_to = 'Value') %>%
group_by(Input, Word) %>%
slice_max(Value) %>%
select(Input, Word, copy_name),
by = c('Input', 'Word', 'copy_name')) -> xf2