使 while 循环更有效地用于大型 data.table 以根据特定条件删除行
Making a while loop more efficient for use on a large data.table to delete rows based on certain conditions
我在数据 table 中有相当多的数据。如果单元格中有某个值,我想删除一些行。
以下是我的数据的摘录 table:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16
2: AT0000A1EKT9 .Close #NV
3: Ask.Close #NV
4: BID.Close #NV
5: Bid Ask Spread #NV 0 0 0 0 0 0 0
6: TR.IssuerRating ba1 ba1 ba1 ba1 ba1 ba1 ba1 ba1
7: AT0000A17HT4 .Close 3.436 3.426 3.376 3.347 3.388 3.379 3.349 3.325
8: Ask.Close 98.092 98.149 98.43 98.596 98.366 98.415 98.584 98.721
9: BID.Close 97.537 97.594 97.874 98.039 97.81 97.859 98.027 98.164
10: Bid Ask Spread 0.555 0.555 0.556 0.557 0.556 0.556 0.557 0.557
11: TR.IssuerRating P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2
使用 dput(head(x)) 所以 table 可以很容易地被复制
setDT(structure(list(V1 = c("", "AT0000A1EKT9", "", "", "", "", "AT0000A17HT4",
"", "", "", ""), V2 = c("01.01.16", ".Close", "Ask.Close", "BID.Close",
"Bid Ask Spread", "TR.IssuerRating", ".Close", "Ask.Close", "BID.Close",
"Bid Ask Spread", "TR.IssuerRating"), V3 = c("04.01.16", "#NV",
"#NV", "#NV", "#NV", "ba1", "3.436", "98.092", "97.537", "0.555",
"P-2"), V4 = c("05.01.16", "", "", "", "0", "ba1", "3.426", "98.149",
"97.594", "0.555", "P-2"), V5 = c("06.01.16", "", "", "", "0",
"ba1", "3.376", "98.43", "97.874", "0.556", "P-2"), V6 = c("07.01.16",
"", "", "", "0", "ba1", "3.347", "98.596", "98.039", "0.557",
"P-2"), V7 = c("08.01.16", "", "", "", "0", "ba1", "3.388", "98.366",
"97.81", "0.556", "P-2"), V8 = c("11.01.16", "", "", "", "0",
"ba1", "3.379", "98.415", "97.859", "0.556", "P-2"), V9 = c("12.01.16",
"", "", "", "0", "ba1", "3.349", "98.584", "98.027", "0.557",
"P-2"), V10 = c("13.01.16", "", "", "", "0", "ba1", "3.325",
"98.721", "98.164", "0.557", "P-2"), V11 = c("14.01.16", "",
"", "", "0", "ba1", "3.3", "98.863", "98.305", "0.558", "P-2"
), V12 = c("15.01.16", "", "", "", "0", "ba1", "3.26", "99.089",
"98.53", "0.559", "P-2"), V13 = c("18.01.16", "", "", "", "0",
"ba1", "3.271", "99.026", "98.468", "0.558", "P-2"), V14 = c("19.01.16",
"", "", "", "0", "ba1", "3.244", "99.177", "98.618", "0.559",
"P-2"), V15 = c("20.01.16", "", "", "", "0", "ba1", "3.238",
"99.211", "98.652", "0.559", "P-2"), V16 = c("21.01.16", "",
"", "", "0", "ba1", "3.268", "99.044", "98.487", "0.557", "P-2"
)), row.names = c(NA, -11L), class = c("data.table", "data.frame"
)))
我在第 1 列中按 ISIN 编号对我的数据进行了分组。对于其中一些 ISIN,我没有任何价格数据,可以通过 .Close
旁边的 #NV
看到
如果 .Close
旁边有 #NV
,我想删除整个 ISIN 条目。
删除我想要的行后数据table应该如下所示:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16
2: AT0000A17HT4 .Close 3.436 3.426 3.376 3.347 3.388 3.379 3.349 3.325
3: Ask.Close 98.092 98.149 98.43 98.596 98.366 98.415 98.584 98.721
4: BID.Close 97.537 97.594 97.874 98.039 97.81 97.859 98.027 98.164
5: Bid Ask Spread 0.555 0.555 0.556 0.557 0.556 0.556 0.557 0.557
6: TR.IssuerRating P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2
我写了一个 while 循环,它适用于少量测试数据。但是,当我将此 while 循环应用于我的完整 data.table 时,循环效率非常低,并且需要很长时间才能 运行 使其无法使用,因为我有大约 100 万行。
while 循环如下所示
i <- 1
while(i < dim(test1)[1]){
if (test1$V2[i] == ".Close" & test1$V3[i] == "#NV"){
a <- i + 4 # creating upper range of rows to be deleted
test1 <- test1[-c(i:a)] #deleting rows and overwriting data.table
i <- 1 #starting loop from beginning again since data.table is smaller
}
else{
i <- i+1
}
}
有没有办法让这个循环更有效率?
如果 V3 是您要过滤的列,一个快速解决方案是使用 dplyr::filter() 删除满足特定条件的行。
data.filtered = 过滤器(数据,!V3 == '#NV')
另外,问同样的问题?
从 data-integrity 的角度来看,空白 V1
与 non-blank V1
在不同的组中:您的数据分为三组,每组 1 行(non-blank) 和一组有 9 行(全部空白 V1
)。为了正确地“按 V1 分组”,我们需要解决这个问题。我认识到有时 non-repeating 值(空白)在美学上更受欢迎,所以我会保留原件。
dat[, V1b := V1
][!nzchar(V1b),V1b := NA
][,V1b := zoo::na.locf(V1b, na.rm = FALSE)]
dat
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
# <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
# 1: 01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16 14.01.16 15.01.16
# 2: AT0000A1EKT9 .Close #NV
# 3: Ask.Close #NV
# 4: BID.Close #NV
# 5: Bid Ask Spread #NV 0 0 0 0 0 0 0 0 0
# 6: TR.IssuerRating ba1 ba1 ba1 ba1 ba1 ba1 ba1 ba1 ba1 ba1
# 7: AT0000A17HT4 .Close 3.436 3.426 3.376 3.347 3.388 3.379 3.349 3.325 3.3 3.26
# 8: Ask.Close 98.092 98.149 98.43 98.596 98.366 98.415 98.584 98.721 98.863 99.089
# 9: BID.Close 97.537 97.594 97.874 98.039 97.81 97.859 98.027 98.164 98.305 98.53
# 10: Bid Ask Spread 0.555 0.555 0.556 0.557 0.556 0.556 0.557 0.557 0.558 0.559
# 11: TR.IssuerRating P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2
# 5 variables not shown: [V13 <char>, V14 <char>, V15 <char>, V16 <char>, V1b <char>]
从这里开始,我觉得过滤应该更容易了。
dat[, .SD[!any(grepl("Close$", V2) & V3 == "#NV"),], by = V1b]
# V1b V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
# 1: <NA> 01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16 14.01.16
# 2: AT0000A17HT4 AT0000A17HT4 .Close 3.436 3.426 3.376 3.347 3.388 3.379 3.349 3.325 3.3
# 3: AT0000A17HT4 Ask.Close 98.092 98.149 98.43 98.596 98.366 98.415 98.584 98.721 98.863
# 4: AT0000A17HT4 BID.Close 97.537 97.594 97.874 98.039 97.81 97.859 98.027 98.164 98.305
# 5: AT0000A17HT4 Bid Ask Spread 0.555 0.555 0.556 0.557 0.556 0.556 0.557 0.557 0.558
# 6: AT0000A17HT4 TR.IssuerRating P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2
# 5 variables not shown: [V12 <char>, V13 <char>, V14 <char>, V15 <char>, V16 <char>]
格式精美的 Excel 电子表格给数据分析带来的挑战的精彩示例。
如果我理解正确,如果 any 列 [=12] 中的值,OP 想要删除一个部分的 all 行=] 等于 #NV
。新部分以 V1
.
列中的非空条目开始
library(data.table)
setDT(test1)[, grp := cumsum(nzchar(V1))][
, if (!any(V3 == "#NV")) .SD, by = grp][
, grp := NULL][]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
1: 01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16
2: AT0000A17HT4 .Close 3.436 3.426 3.376 3.347 3.388 3.379 3.349 3.325
3: Ask.Close 98.092 98.149 98.43 98.596 98.366 98.415 98.584 98.721
4: BID.Close 97.537 97.594 97.874 98.039 97.81 97.859 98.027 98.164
5: Bid Ask Spread 0.555 0.555 0.556 0.557 0.556 0.556 0.557 0.557
6: TR.IssuerRating P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2
6 variable(s) not shown: [V11 <char>, V12 <char>, V13 <char>, V14 <char>, V15 <char>, V16 <char>]
我在数据 table 中有相当多的数据。如果单元格中有某个值,我想删除一些行。
以下是我的数据的摘录 table:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16
2: AT0000A1EKT9 .Close #NV
3: Ask.Close #NV
4: BID.Close #NV
5: Bid Ask Spread #NV 0 0 0 0 0 0 0
6: TR.IssuerRating ba1 ba1 ba1 ba1 ba1 ba1 ba1 ba1
7: AT0000A17HT4 .Close 3.436 3.426 3.376 3.347 3.388 3.379 3.349 3.325
8: Ask.Close 98.092 98.149 98.43 98.596 98.366 98.415 98.584 98.721
9: BID.Close 97.537 97.594 97.874 98.039 97.81 97.859 98.027 98.164
10: Bid Ask Spread 0.555 0.555 0.556 0.557 0.556 0.556 0.557 0.557
11: TR.IssuerRating P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2
使用 dput(head(x)) 所以 table 可以很容易地被复制
setDT(structure(list(V1 = c("", "AT0000A1EKT9", "", "", "", "", "AT0000A17HT4",
"", "", "", ""), V2 = c("01.01.16", ".Close", "Ask.Close", "BID.Close",
"Bid Ask Spread", "TR.IssuerRating", ".Close", "Ask.Close", "BID.Close",
"Bid Ask Spread", "TR.IssuerRating"), V3 = c("04.01.16", "#NV",
"#NV", "#NV", "#NV", "ba1", "3.436", "98.092", "97.537", "0.555",
"P-2"), V4 = c("05.01.16", "", "", "", "0", "ba1", "3.426", "98.149",
"97.594", "0.555", "P-2"), V5 = c("06.01.16", "", "", "", "0",
"ba1", "3.376", "98.43", "97.874", "0.556", "P-2"), V6 = c("07.01.16",
"", "", "", "0", "ba1", "3.347", "98.596", "98.039", "0.557",
"P-2"), V7 = c("08.01.16", "", "", "", "0", "ba1", "3.388", "98.366",
"97.81", "0.556", "P-2"), V8 = c("11.01.16", "", "", "", "0",
"ba1", "3.379", "98.415", "97.859", "0.556", "P-2"), V9 = c("12.01.16",
"", "", "", "0", "ba1", "3.349", "98.584", "98.027", "0.557",
"P-2"), V10 = c("13.01.16", "", "", "", "0", "ba1", "3.325",
"98.721", "98.164", "0.557", "P-2"), V11 = c("14.01.16", "",
"", "", "0", "ba1", "3.3", "98.863", "98.305", "0.558", "P-2"
), V12 = c("15.01.16", "", "", "", "0", "ba1", "3.26", "99.089",
"98.53", "0.559", "P-2"), V13 = c("18.01.16", "", "", "", "0",
"ba1", "3.271", "99.026", "98.468", "0.558", "P-2"), V14 = c("19.01.16",
"", "", "", "0", "ba1", "3.244", "99.177", "98.618", "0.559",
"P-2"), V15 = c("20.01.16", "", "", "", "0", "ba1", "3.238",
"99.211", "98.652", "0.559", "P-2"), V16 = c("21.01.16", "",
"", "", "0", "ba1", "3.268", "99.044", "98.487", "0.557", "P-2"
)), row.names = c(NA, -11L), class = c("data.table", "data.frame"
)))
我在第 1 列中按 ISIN 编号对我的数据进行了分组。对于其中一些 ISIN,我没有任何价格数据,可以通过 .Close
旁边的 #NV
看到
如果 .Close
旁边有 #NV
,我想删除整个 ISIN 条目。
删除我想要的行后数据table应该如下所示:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16
2: AT0000A17HT4 .Close 3.436 3.426 3.376 3.347 3.388 3.379 3.349 3.325
3: Ask.Close 98.092 98.149 98.43 98.596 98.366 98.415 98.584 98.721
4: BID.Close 97.537 97.594 97.874 98.039 97.81 97.859 98.027 98.164
5: Bid Ask Spread 0.555 0.555 0.556 0.557 0.556 0.556 0.557 0.557
6: TR.IssuerRating P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2
我写了一个 while 循环,它适用于少量测试数据。但是,当我将此 while 循环应用于我的完整 data.table 时,循环效率非常低,并且需要很长时间才能 运行 使其无法使用,因为我有大约 100 万行。
while 循环如下所示
i <- 1
while(i < dim(test1)[1]){
if (test1$V2[i] == ".Close" & test1$V3[i] == "#NV"){
a <- i + 4 # creating upper range of rows to be deleted
test1 <- test1[-c(i:a)] #deleting rows and overwriting data.table
i <- 1 #starting loop from beginning again since data.table is smaller
}
else{
i <- i+1
}
}
有没有办法让这个循环更有效率?
如果 V3 是您要过滤的列,一个快速解决方案是使用 dplyr::filter() 删除满足特定条件的行。
data.filtered = 过滤器(数据,!V3 == '#NV')
另外,
从 data-integrity 的角度来看,空白 V1
与 non-blank V1
在不同的组中:您的数据分为三组,每组 1 行(non-blank) 和一组有 9 行(全部空白 V1
)。为了正确地“按 V1 分组”,我们需要解决这个问题。我认识到有时 non-repeating 值(空白)在美学上更受欢迎,所以我会保留原件。
dat[, V1b := V1
][!nzchar(V1b),V1b := NA
][,V1b := zoo::na.locf(V1b, na.rm = FALSE)]
dat
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
# <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
# 1: 01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16 14.01.16 15.01.16
# 2: AT0000A1EKT9 .Close #NV
# 3: Ask.Close #NV
# 4: BID.Close #NV
# 5: Bid Ask Spread #NV 0 0 0 0 0 0 0 0 0
# 6: TR.IssuerRating ba1 ba1 ba1 ba1 ba1 ba1 ba1 ba1 ba1 ba1
# 7: AT0000A17HT4 .Close 3.436 3.426 3.376 3.347 3.388 3.379 3.349 3.325 3.3 3.26
# 8: Ask.Close 98.092 98.149 98.43 98.596 98.366 98.415 98.584 98.721 98.863 99.089
# 9: BID.Close 97.537 97.594 97.874 98.039 97.81 97.859 98.027 98.164 98.305 98.53
# 10: Bid Ask Spread 0.555 0.555 0.556 0.557 0.556 0.556 0.557 0.557 0.558 0.559
# 11: TR.IssuerRating P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2
# 5 variables not shown: [V13 <char>, V14 <char>, V15 <char>, V16 <char>, V1b <char>]
从这里开始,我觉得过滤应该更容易了。
dat[, .SD[!any(grepl("Close$", V2) & V3 == "#NV"),], by = V1b]
# V1b V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
# 1: <NA> 01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16 14.01.16
# 2: AT0000A17HT4 AT0000A17HT4 .Close 3.436 3.426 3.376 3.347 3.388 3.379 3.349 3.325 3.3
# 3: AT0000A17HT4 Ask.Close 98.092 98.149 98.43 98.596 98.366 98.415 98.584 98.721 98.863
# 4: AT0000A17HT4 BID.Close 97.537 97.594 97.874 98.039 97.81 97.859 98.027 98.164 98.305
# 5: AT0000A17HT4 Bid Ask Spread 0.555 0.555 0.556 0.557 0.556 0.556 0.557 0.557 0.558
# 6: AT0000A17HT4 TR.IssuerRating P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2
# 5 variables not shown: [V12 <char>, V13 <char>, V14 <char>, V15 <char>, V16 <char>]
格式精美的 Excel 电子表格给数据分析带来的挑战的精彩示例。
如果我理解正确,如果 any 列 [=12] 中的值,OP 想要删除一个部分的 all 行=] 等于 #NV
。新部分以 V1
.
library(data.table)
setDT(test1)[, grp := cumsum(nzchar(V1))][
, if (!any(V3 == "#NV")) .SD, by = grp][
, grp := NULL][]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> 1: 01.01.16 04.01.16 05.01.16 06.01.16 07.01.16 08.01.16 11.01.16 12.01.16 13.01.16 2: AT0000A17HT4 .Close 3.436 3.426 3.376 3.347 3.388 3.379 3.349 3.325 3: Ask.Close 98.092 98.149 98.43 98.596 98.366 98.415 98.584 98.721 4: BID.Close 97.537 97.594 97.874 98.039 97.81 97.859 98.027 98.164 5: Bid Ask Spread 0.555 0.555 0.556 0.557 0.556 0.556 0.557 0.557 6: TR.IssuerRating P-2 P-2 P-2 P-2 P-2 P-2 P-2 P-2 6 variable(s) not shown: [V11 <char>, V12 <char>, V13 <char>, V14 <char>, V15 <char>, V16 <char>]