计算与其他列的双重类别关联的列中的特定字符。基于频点迭代地进行
Count specific characters from column associated with dual categories of other column. Do it iteratively based on frequency bins
我有一个巨大的数据框 df1,其过于简化的版本由 3 列组成,"Words"、"Frequency" 和 "Letters":
Words Frequency Letters
flower/tree 0.15 a(0.1)
tree 0.67 a(0.4)
planet 0.85 b(0.4)
tree/planet 0.42 c(0.5)
tree 0.89 a(0.6)
flower 0.21 b(0.4)
flower/planet 0.53 b
planet 0.07 a
使用 R(dplyr,应用族函数等)我想计算 "Letter" 列的每个字母(a,b,c)与来自的每个单词相关联的次数"Word" 列(花、树、行星),以依赖于 "Frequency" 列值的频率仓的迭代方式。有 4 个 bin:[0, 0.25], [0.25, 0.5], [0.5, 0.75], [0.75, 1].
我希望输出数据帧 df2 看起来像这样:
Bin Word Letters count_letters
0-0.25 flower a 1
0-0.25 flower b 1
0-0.25 tree a 1
0-0.25 planet a 1
0.25-0.5 tree c 1
0.25-0.5 planet c 1
0.5-0.75 flower b 1
0.5-0.75 tree a 1
0.5-0.75 planet b 1
0.75-1 tree a 1
0.75-1 planet b 1
您可以使用 cut
装箱 Frequency
,substr
清理 Letters
,并使用 tidyr::separate_rows
取消嵌套 Word
。与 dplyr::count
聚合,您将设置为:
library(tidyverse)
df %>% separate_rows(Words) %>%
count(Words,
Letters = substr(Letters, 1, 1), # use regex if more than one letter
Frequency = cut(Frequency, breaks = seq(0, 1, .25)))
## Source: local data frame [11 x 4]
## Groups: Frequency, Words [?]
##
## Frequency Words Letters n
## <fctr> <chr> <chr> <int>
## 1 (0,0.25] flower a 1
## 2 (0,0.25] flower b 1
## 3 (0,0.25] planet a 1
## 4 (0,0.25] tree a 1
## 5 (0.25,0.5] planet c 1
## 6 (0.25,0.5] tree c 1
## 7 (0.5,0.75] flower b 1
## 8 (0.5,0.75] planet b 1
## 9 (0.5,0.75] tree a 1
## 10 (0.75,1] planet b 1
## 11 (0.75,1] tree a 1
我有一个巨大的数据框 df1,其过于简化的版本由 3 列组成,"Words"、"Frequency" 和 "Letters":
Words Frequency Letters
flower/tree 0.15 a(0.1)
tree 0.67 a(0.4)
planet 0.85 b(0.4)
tree/planet 0.42 c(0.5)
tree 0.89 a(0.6)
flower 0.21 b(0.4)
flower/planet 0.53 b
planet 0.07 a
使用 R(dplyr,应用族函数等)我想计算 "Letter" 列的每个字母(a,b,c)与来自的每个单词相关联的次数"Word" 列(花、树、行星),以依赖于 "Frequency" 列值的频率仓的迭代方式。有 4 个 bin:[0, 0.25], [0.25, 0.5], [0.5, 0.75], [0.75, 1].
我希望输出数据帧 df2 看起来像这样:
Bin Word Letters count_letters
0-0.25 flower a 1
0-0.25 flower b 1
0-0.25 tree a 1
0-0.25 planet a 1
0.25-0.5 tree c 1
0.25-0.5 planet c 1
0.5-0.75 flower b 1
0.5-0.75 tree a 1
0.5-0.75 planet b 1
0.75-1 tree a 1
0.75-1 planet b 1
您可以使用 cut
装箱 Frequency
,substr
清理 Letters
,并使用 tidyr::separate_rows
取消嵌套 Word
。与 dplyr::count
聚合,您将设置为:
library(tidyverse)
df %>% separate_rows(Words) %>%
count(Words,
Letters = substr(Letters, 1, 1), # use regex if more than one letter
Frequency = cut(Frequency, breaks = seq(0, 1, .25)))
## Source: local data frame [11 x 4]
## Groups: Frequency, Words [?]
##
## Frequency Words Letters n
## <fctr> <chr> <chr> <int>
## 1 (0,0.25] flower a 1
## 2 (0,0.25] flower b 1
## 3 (0,0.25] planet a 1
## 4 (0,0.25] tree a 1
## 5 (0.25,0.5] planet c 1
## 6 (0.25,0.5] tree c 1
## 7 (0.5,0.75] flower b 1
## 8 (0.5,0.75] planet b 1
## 9 (0.5,0.75] tree a 1
## 10 (0.75,1] planet b 1
## 11 (0.75,1] tree a 1