如何收集唯一值,并根据条件对其他列求和
How to collect unique values, and sum across other columns with conditions
我有大量大约一百万行的金融交易数据,我希望能够将其压缩到一个包含唯一用户 ID 列表的新数据框中。然后我希望能够在某些条件下为他们的帐户添加“交易”,即如果 TransactionTypeId == 2 & AC_Type== 19。我会在 excel 中使用 sumifs但是文件的大小意味着它几乎不可能 运行 在我的电脑上。
df<- structure(list(UserId = c(1, 1, 1, 1, 2,
2, 2, 3, 3, 3, 4, 5, 6,
6, 6, 7, 7, 7, 8, 8, 8,
8, 8, 9, 9, 9, 10, 11, 12,
12, 13, 13, 13, 14, 14, 15, 15,
16, 16, 16), TransactionTypeId = c(14, 1, 1, 70,
15, 1, 1, 14, 1, 1, 70, 14, 14, 1, 1, 14, 1, 1, 14, 1, 1, 1,
1, 14, 1, 1, 14, 14, 1, 1, 14, 1, 1, 1, 1, 70, 70, 14, 1, 1),
AC_Type = c(21, 21, 21, 21, 19, 19, 19, 19, 19, 19, 19, 19,
19, 19, 19, 21, 21, 21, 19, 19, 19, 19, 19, 19, 19, 19, 20,
19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20), Trades = c(30,
30, 0.00067116, 0.00067115, 249, 249, 0.00533033, 48.75,
48.75, 0.00101298, 0.00533, 24.37, 146.25, 146.25, 0.00309109,
100.01, 100.01, 0.00233551, 97.5, 90, 0.00189134, 5, 0.00245851,
234, 234, 0.00500802, 100.01, 48.75, 48.5, 0.0275474, 24,
24, 0.00051975, 100, 0.00223998, 0.00051975, 0.00205, 9.75,
8.75, 0.00017811)), row.names = c(NA, -40L), class = c("tbl_df",
"tbl", "data.frame"))
你可以取sum
你要统计的逻辑条件
library(dplyr)
df %>%
group_by(UserId) %>%
summarise(count = sum(Trades[TransactionTypeId == 2 & AC_Type== 19]))
不太确定你想要什么...
libary(dplyr)
df %>%
group_by(UserId) %>%
filter(TransactionTypeId == 1 & AC_Type == 19) %>%
summarise(sum = sum(Trades))
# A tibble: 6 x 2
UserId sum
<dbl> <dbl>
1 2 249.
2 3 48.8
3 6 146.
4 8 95.0
5 9 234.
6 12 48.5
给你先group_by
UserId
,然后filter
那些符合你条件的行(注意:我已经把2
改成1
了示例数据中没有任何 2
),最后 summarise
通过对 Trades
.
中的值求和
使用data.table
library(data.table)
setDT(df)[, .(count = sum(Trades[TransactionTypeId == 2 &
AC_Type== 19], na.rm = TRUE)), UserId]
我有大量大约一百万行的金融交易数据,我希望能够将其压缩到一个包含唯一用户 ID 列表的新数据框中。然后我希望能够在某些条件下为他们的帐户添加“交易”,即如果 TransactionTypeId == 2 & AC_Type== 19。我会在 excel 中使用 sumifs但是文件的大小意味着它几乎不可能 运行 在我的电脑上。
df<- structure(list(UserId = c(1, 1, 1, 1, 2,
2, 2, 3, 3, 3, 4, 5, 6,
6, 6, 7, 7, 7, 8, 8, 8,
8, 8, 9, 9, 9, 10, 11, 12,
12, 13, 13, 13, 14, 14, 15, 15,
16, 16, 16), TransactionTypeId = c(14, 1, 1, 70,
15, 1, 1, 14, 1, 1, 70, 14, 14, 1, 1, 14, 1, 1, 14, 1, 1, 1,
1, 14, 1, 1, 14, 14, 1, 1, 14, 1, 1, 1, 1, 70, 70, 14, 1, 1),
AC_Type = c(21, 21, 21, 21, 19, 19, 19, 19, 19, 19, 19, 19,
19, 19, 19, 21, 21, 21, 19, 19, 19, 19, 19, 19, 19, 19, 20,
19, 19, 19, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20), Trades = c(30,
30, 0.00067116, 0.00067115, 249, 249, 0.00533033, 48.75,
48.75, 0.00101298, 0.00533, 24.37, 146.25, 146.25, 0.00309109,
100.01, 100.01, 0.00233551, 97.5, 90, 0.00189134, 5, 0.00245851,
234, 234, 0.00500802, 100.01, 48.75, 48.5, 0.0275474, 24,
24, 0.00051975, 100, 0.00223998, 0.00051975, 0.00205, 9.75,
8.75, 0.00017811)), row.names = c(NA, -40L), class = c("tbl_df",
"tbl", "data.frame"))
你可以取sum
你要统计的逻辑条件
library(dplyr)
df %>%
group_by(UserId) %>%
summarise(count = sum(Trades[TransactionTypeId == 2 & AC_Type== 19]))
不太确定你想要什么...
libary(dplyr)
df %>%
group_by(UserId) %>%
filter(TransactionTypeId == 1 & AC_Type == 19) %>%
summarise(sum = sum(Trades))
# A tibble: 6 x 2
UserId sum
<dbl> <dbl>
1 2 249.
2 3 48.8
3 6 146.
4 8 95.0
5 9 234.
6 12 48.5
给你先group_by
UserId
,然后filter
那些符合你条件的行(注意:我已经把2
改成1
了示例数据中没有任何 2
),最后 summarise
通过对 Trades
.
使用data.table
library(data.table)
setDT(df)[, .(count = sum(Trades[TransactionTypeId == 2 &
AC_Type== 19], na.rm = TRUE)), UserId]