R data.table 分组并迭代两列
R data.table group by and iterate by two columns
我是 R 的新手,正在尝试解决以下问题:
有一个 table 有两列 books
和 readers
这些书,其中 books
和 readers
是书, reader ID,分别为:
> books = c (1,2,3,1,1,2)
> readers = c(30, 10, 20, 20, 10, 30)
> bt = data.table(books, readers)
> bt
books readers
1: 1 30
2: 2 10
3: 3 20
4: 1 20
5: 1 10
6: 2 30
对于每对书,我需要计算阅读这两本书的 reader 的数量,使用此算法:
for each book
for each reader of the book
for each other_book in books of the reader
increment common_reader_count ((book, other_book), cnt)
为了实现上述算法,我需要将这些数据分为两个列表:1) 包含每本书的 readers 的图书列表和 2) readers 的列表,包含阅读的书籍每个reader,例如:
> bookList = list(
+ list(1, list(30, 20, 10)),
+ list(2, list(10, 30)),
+ list(3, list(20))
+ )
>
> readerList = list (
+ list(30, list(1,2)),
+ list(20, list(3,1)),
+ list(10, list(2,1))
+ )
>
问题:
1) 使用什么函数从书中构建这些列表table?
2) 从 bookList
和 readerList
如何生成读过这两本书的 reader 数量的书对?对于上述 bt
本书 table,结果应为:
((1, 2), 2)
((1,3), 1)
((2,3), 0)
成对书籍的顺序无关紧要,因此,例如 (1,2)
和 (2,1)
应减少为其中之一。
请指教函数和数据结构来解决这个问题。谢谢!
更新:
理想情况下,我需要得到一个矩阵,其中包含书籍 ID 的行和列。交集是 reader 阅读这对书籍的计数。所以对于上面的例子矩阵应该是:
books | 1 | 2 | 3 |
1 | 1 | 2 | 1 |
2 | 2 | 1 | 0 |
3 | 1 | 0 | 1 |
Which means:
book 1 and 2 are read together by 2 readers
book 1 and 3 are read together by 1 reader
book 2 and 3 are read together by 0 readers
如何构建这样的矩阵?
试试这个:
## gives you a seperate list for each book
list_bookls <- split(bt$readers, books)
## gives you a seperate list for each reader
list_readers <- split(bt$books, readers)
另一种输出形式,输出为data.table并给出每个reader阅读的书籍数量和每个reader阅读的书籍数量:
bt[ , .("N Books" = length(unique(books))), by = readers]
bt[ , .("N Readers" = length(unique(readers))), by = readers]
对于你问题的第二部分,我将使用以下内容:
bt2 <- bt[ , .N, by = .(readers, books)]
library(tidyr)
spread(bt2, key = books, value = "N", fill = 0)
输出是一个 table,如果书被 reader X 阅读则给出 1,否则给出 0:
readers 1 2 3
1: 10 1 1 0
2: 20 1 0 1
3: 30 1 1 0
这是一个基本的 R 解决方案,用于测试是否读取了对。如果您绝对需要使用它,其他人可以为 data.table
添加一个:
books = c (1,2,3,1,1,2)
readers = c(30, 10, 20, 20, 10, 30)
bks = data.frame(books, readers)
cmb <- combn(unique(books), 2)
cmb <- t(cmb)
combos <- as.data.frame(cmb)
bktbl <- t(table(bks))
for (i in 1:nrow(bktbl)) {
x[i] <- sum(bktbl[i, cmb[i, 1]], bktbl[i, cmb[i, 2]])
combos$PairRead <- ifelse(x > 1,"yes", "no")
}
combos
V1 V2 PairRead
1 1 2 yes
2 1 3 yes
3 2 3 no
这是另一个选项:
combs <- combn(unique(books), 2)# Generate combos of books
setkey(bt, books)
both.read <-bt[ # Cartesian join all combos to our data
data.table(books=c(combs), combo.id=c(col(combs))), allow.cartesian=T
][,
.( # For each combo, figure out how many readers show up twice, meaning they've read both books
read.both=sum(duplicated(readers)),
book1=min(books), book2=max(books)
),
by=combo.id
]
dcast.data.table( # dcast to desired format
both.read, book1 ~ book2, value.var="read.both", fun.aggregate=sum
)
产生:
book1 2 3
1: 1 2 1
2: 2 0 0
请注意,在设计上这只会进行非等效组合(即我们不显示书籍 1-2 和 2-1,只显示 1-2,因为它们是相同的)。
我是 R 的新手,正在尝试解决以下问题:
有一个 table 有两列 books
和 readers
这些书,其中 books
和 readers
是书, reader ID,分别为:
> books = c (1,2,3,1,1,2)
> readers = c(30, 10, 20, 20, 10, 30)
> bt = data.table(books, readers)
> bt
books readers
1: 1 30
2: 2 10
3: 3 20
4: 1 20
5: 1 10
6: 2 30
对于每对书,我需要计算阅读这两本书的 reader 的数量,使用此算法:
for each book
for each reader of the book
for each other_book in books of the reader
increment common_reader_count ((book, other_book), cnt)
为了实现上述算法,我需要将这些数据分为两个列表:1) 包含每本书的 readers 的图书列表和 2) readers 的列表,包含阅读的书籍每个reader,例如:
> bookList = list(
+ list(1, list(30, 20, 10)),
+ list(2, list(10, 30)),
+ list(3, list(20))
+ )
>
> readerList = list (
+ list(30, list(1,2)),
+ list(20, list(3,1)),
+ list(10, list(2,1))
+ )
>
问题:
1) 使用什么函数从书中构建这些列表table?
2) 从 bookList
和 readerList
如何生成读过这两本书的 reader 数量的书对?对于上述 bt
本书 table,结果应为:
((1, 2), 2)
((1,3), 1)
((2,3), 0)
成对书籍的顺序无关紧要,因此,例如 (1,2)
和 (2,1)
应减少为其中之一。
请指教函数和数据结构来解决这个问题。谢谢!
更新:
理想情况下,我需要得到一个矩阵,其中包含书籍 ID 的行和列。交集是 reader 阅读这对书籍的计数。所以对于上面的例子矩阵应该是:
books | 1 | 2 | 3 |
1 | 1 | 2 | 1 |
2 | 2 | 1 | 0 |
3 | 1 | 0 | 1 |
Which means:
book 1 and 2 are read together by 2 readers
book 1 and 3 are read together by 1 reader
book 2 and 3 are read together by 0 readers
如何构建这样的矩阵?
试试这个:
## gives you a seperate list for each book
list_bookls <- split(bt$readers, books)
## gives you a seperate list for each reader
list_readers <- split(bt$books, readers)
另一种输出形式,输出为data.table并给出每个reader阅读的书籍数量和每个reader阅读的书籍数量:
bt[ , .("N Books" = length(unique(books))), by = readers]
bt[ , .("N Readers" = length(unique(readers))), by = readers]
对于你问题的第二部分,我将使用以下内容:
bt2 <- bt[ , .N, by = .(readers, books)]
library(tidyr)
spread(bt2, key = books, value = "N", fill = 0)
输出是一个 table,如果书被 reader X 阅读则给出 1,否则给出 0:
readers 1 2 3
1: 10 1 1 0
2: 20 1 0 1
3: 30 1 1 0
这是一个基本的 R 解决方案,用于测试是否读取了对。如果您绝对需要使用它,其他人可以为 data.table
添加一个:
books = c (1,2,3,1,1,2)
readers = c(30, 10, 20, 20, 10, 30)
bks = data.frame(books, readers)
cmb <- combn(unique(books), 2)
cmb <- t(cmb)
combos <- as.data.frame(cmb)
bktbl <- t(table(bks))
for (i in 1:nrow(bktbl)) {
x[i] <- sum(bktbl[i, cmb[i, 1]], bktbl[i, cmb[i, 2]])
combos$PairRead <- ifelse(x > 1,"yes", "no")
}
combos
V1 V2 PairRead
1 1 2 yes
2 1 3 yes
3 2 3 no
这是另一个选项:
combs <- combn(unique(books), 2)# Generate combos of books
setkey(bt, books)
both.read <-bt[ # Cartesian join all combos to our data
data.table(books=c(combs), combo.id=c(col(combs))), allow.cartesian=T
][,
.( # For each combo, figure out how many readers show up twice, meaning they've read both books
read.both=sum(duplicated(readers)),
book1=min(books), book2=max(books)
),
by=combo.id
]
dcast.data.table( # dcast to desired format
both.read, book1 ~ book2, value.var="read.both", fun.aggregate=sum
)
产生:
book1 2 3
1: 1 2 1
2: 2 0 0
请注意,在设计上这只会进行非等效组合(即我们不显示书籍 1-2 和 2-1,只显示 1-2,因为它们是相同的)。