计算与其他列的双重类别关联的列中的特定字符。基于频点迭代地进行

Question

我有一个巨大的数据框 df1，其过于简化的版本由 3 列组成，"Words"、"Frequency" 和 "Letters":

Words           Frequency   Letters
flower/tree     0.15        a(0.1)
tree            0.67        a(0.4)
planet          0.85        b(0.4)
tree/planet     0.42        c(0.5)
tree            0.89        a(0.6)
flower          0.21        b(0.4)
flower/planet   0.53        b
planet          0.07        a

使用 R（dplyr，应用族函数等）我想计算 "Letter" 列的每个字母（a，b，c）与来自的每个单词相关联的次数"Word" 列（花、树、行星），以依赖于 "Frequency" 列值的频率仓的迭代方式。有 4 个 bin：[0, 0.25], [0.25, 0.5], [0.5, 0.75], [0.75, 1].

我希望输出数据帧 df2 看起来像这样：

Bin       Word    Letters    count_letters
0-0.25    flower  a          1
0-0.25    flower  b          1
0-0.25    tree    a          1
0-0.25    planet  a          1
0.25-0.5  tree    c          1
0.25-0.5  planet  c          1
0.5-0.75  flower  b          1
0.5-0.75  tree    a          1
0.5-0.75  planet  b          1
0.75-1    tree    a          1
0.75-1    planet  b          1

Answer 1

您可以使用 cut 装箱 Frequency，substr 清理 Letters，并使用 tidyr::separate_rows 取消嵌套 Word。与 dplyr::count 聚合，您将设置为：

library(tidyverse)

df %>% separate_rows(Words) %>% 
    count(Words, 
          Letters = substr(Letters, 1, 1),    # use regex if more than one letter
          Frequency = cut(Frequency, breaks = seq(0, 1, .25)))

## Source: local data frame [11 x 4]
## Groups: Frequency, Words [?]
## 
##     Frequency  Words Letters     n
##        <fctr>  <chr>   <chr> <int>
## 1    (0,0.25] flower       a     1
## 2    (0,0.25] flower       b     1
## 3    (0,0.25] planet       a     1
## 4    (0,0.25]   tree       a     1
## 5  (0.25,0.5] planet       c     1
## 6  (0.25,0.5]   tree       c     1
## 7  (0.5,0.75] flower       b     1
## 8  (0.5,0.75] planet       b     1
## 9  (0.5,0.75]   tree       a     1
## 10   (0.75,1] planet       b     1
## 11   (0.75,1]   tree       a     1

计算与其他列的双重类别关联的列中的特定字符。基于频点迭代地进行

Count specific characters from column associated with dual categories of other column. Do it iteratively based on frequency bins

r

apply

dataframe

dplyr