在 R 中的 data.table 对象内操作 char 向量
Manipulate char vectors inside a data.table object in R
我对使用 data.table 和理解它的所有微妙之处还是有点陌生。
我查看了 SO 中的文档和其他示例,但找不到我想要的内容,所以请帮忙!
我有一个 data.table,它基本上是一个字符向量(每个条目都是一个句子)
DT=c("I love you","she loves me")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)
# > DT
# text
# 1: I love you
# 2: she loves me
我想做的是能够在 DT 对象中执行一些基本的字符串操作。例如,添加一个新列,其中我将有一个 char 向量,其中每个条目都是来自 "text" 列中字符串的 WORD。
所以我想有一个新列 charvec where
> DT[1]$charvec
[1] "I" "love "you"
当然,我想以 data.table 的方式来做,超快,因为我需要在>1Go文件的fils上做这种事情,并使用更复杂的计算-繁重的功能。所以没有使用APPLY,LAPPLY,和MAPPLY
我最接近的尝试如下:
myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
# > DU2
# text charvec
# 1: I love you I,love,you
# 2: she loves me she,loves,me
例如,为了制作一个删除每个句子的第一个单词的函数,我这样做了
myfun2 <- function(l){l[[1]][-1]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
# > DV2
# text charvec
# 1: I love you love,you
# 2: she loves me loves,me
问题是,在 charvec 列中,我有一个列表而不是向量...
> str(DU2[1]$charvec)
# List of 1
# $ : chr [1:3] "I" "love" "you"
1) 我怎样才能做我想做的事?
我想使用的其他类型的函数是对 char 向量进行子集化,或者对其应用一些哈希等。
2) 顺便说一句,我可以一行而不是两行到达 DU2 或 DV2 吗?
3) 我不太理解 data.table 的语法。为什么在 [..] 中使用命令 list()
,V1 列消失了?
4) 在另一个线程上,我阅读了一些关于函数 cSplit
.
的内容
。有什么好处吗?是适配data.table个对象的函数吗?
非常感谢
更新
感谢@Ananda Mahto
或许我应该让自己更清楚自己的终极目标objective
我有一个包含 10,000,000 个句子的巨大文件,这些句子存储为字符串。
作为该项目的第一步,我想对每个句子的前 5 个单词进行哈希处理。 10,000,000 个句子甚至不会进入我的记忆,所以我首先将 1,000,000 个句子分成 10 个文件,大约是 10x 1Go 文件。
以下代码在我的笔记本电脑上只需要几分钟的时间来处理一个文件。
library(data.table); library(digest);
num_row=1000000
DT <- fread("sentences.txt",nrows=num_row,header=FALSE,sep="\t",colClasses="character")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)
rawdata <- DT
hash2 <- function(word){ #using library(digest)
as.numeric(paste("0x",digest(word,algo="murmur32"),sep=""))
}
那么,
print(system.time({
colnames(rawdata) <- "sentence"
rawdata <- lapply(rawdata,strsplit," ")
sentences_begin <- lapply(rawdata$sentence,function(x){x[2:6]})
hash_list <- sapply(sentences_begin,hash2)
# remove(rawdata)
})) ## end of print system.time for loading the data
我知道我正在将 R 推向极限,但我正在努力寻找更快的实现,并且我正在考虑 data.table 功能......因此我所有的问题
这里是一个不包括lapply的实现,但它实际上更慢!
print(system.time({
myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
myfun2 <- function(l){l[[1]][2:6]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
rebuildsentence <- function(S){
paste(S,collapse=" ") }
myfun3 <- function(l){hash2(rebuildsentence(l[[1]]))}
DW1 <- DV2[,myfun3(charvec),by=text]
})) #end of system.time
在这个带有数据文件的实现中,没有 lapply,所以我希望散列会更快。然而,因为在每一列中我都有一个列表而不是一个 char 向量,这可能会显着减慢(?)整个过程。
使用上面的第一个代码(使用 lapply
/sapply
)在我的笔记本电脑上花费了 1 个多小时。我希望用更高效的数据结构来加快速度?人们使用 Python、Java 等...在几秒钟内完成类似的工作。
当然,另一条路是找到一个更快的散列函数,但我假设 digest
包中的那个已经被优化了。
我不太确定你想要什么,但你可以尝试 cSplit_l
从我的 "splitstackshape" 包中进入你的列表栏:
library(splitstackshape)
DU <- cSplit_l(DT, "DT", " ")
然后,您可以编写如下函数从列表列中删除值:
RemovePos <- function(inList, pos = 1) {
lapply(inList, function(x) x[-c(pos[pos <= length(x)])])
}
用法示例:
DU[, list(RemovePos(DT_list, 1)), by = DT]
# DT V1
# 1: I love you love,you
# 2: she loves me loves,me
DU[, list(RemovePos(DT_list, 2)), by = DT]
# DT V1
# 1: I love you I,you
# 2: she loves me she,me
DU[, list(RemovePos(DT_list, c(1, 2))), by = DT]
# DT V1
# 1: I love you you
# 2: she loves me me
更新
基于你对 `lapply 的厌恶,也许你可以尝试如下操作:
## make a copy of your "text" column
DT[, vals := text]
## Use `cSplit` to create a "long" dataset.
## Add a column to indicate the word's position in the text.
DTL <- cSplit(DT, "vals", " ", "long")[, ind := sequence(.N), by = text][]
DTL
# text vals ind
# 1: I love you I 1
# 2: I love you love 2
# 3: I love you you 3
# 4: she loves me she 1
# 5: she loves me loves 2
# 6: she loves me me 3
## Now, you can extract values easily
DTL[ind == 1]
# text vals ind
# 1: I love you I 1
# 2: she loves me she 1
DTL[ind %in% c(1, 3)]
# text vals ind
# 1: I love you I 1
# 2: I love you you 3
# 3: she loves me she 1
# 4: she loves me me 3
更新 2
我不知道你得到的是什么类型的计时,但正如我在评论中提到的,你也许可以尝试使用正则表达式,这样你就不必拆分字符串然后将其粘贴回一起。
这是一个示例....
设置一些数据来玩:
library(data.table)
DT <- data.table(
text = c("This is a sentence with a lot of words.",
"This is a sentence with some more words.",
"Words and words and even some more words.",
"But, I don't know how you want to deal with punctuation...",
"Just one more sentence, for easy multiplication.")
)
DT2 <- rbindlist(replicate(10000/nrow(DT), DT, FALSE))
DT3 <- rbindlist(replicate(1000000/nrow(DT), DT, FALSE))
测试 gsub 模式以从每个句子中提取 5 个单词....
## Regex to extract first five words -- this should work....
patt <- "^((?:\S+\s+){4}\S+).*"
## Check out some of the timings
system.time(temp <- DT2[, gsub(patt, "\1", text)])
# user system elapsed
# 0.03 0.00 0.03
system.time(temp2 <- DT3[, gsub(patt, "\1", text)])
# user system elapsed
# 3 0 3
head(temp)
# [1] "This is a sentence with" "This is a sentence with" "Words and words and even"
# [4] "But, I don't know how" "Just one more sentence, for" "This is a sentence with"
我猜你想做什么....
## I'm assuming you want something like this....
## Takes about a minute on my system.
## ... but note the system time for the creation of "temp2" (without digest)
## Not sure if I interpreted your hash requirement correctly....
system.time(out <- DT3[
, firstFive := gsub(patt, "\1", text)][
, firstFiveHash := hash2(firstFive), by = 1:nrow(DT3)][])
# user system elapsed
# 62.14 0.05 62.20
head(out)
# text firstFive firstFiveHash
# 1: This is a sentence with a lot of words. This is a sentence with 4179639471
# 2: This is a sentence with some more words. This is a sentence with 4179639471
# 3: Words and words and even some more words. Words and words and even 2556713080
# 4: But, I don't know how you want to deal with punctuation... But, I don't know how 3765680401
# 5: Just one more sentence, for easy multiplication. Just one more sentence, for 298317689
# 6: This is a sentence with a lot of words. This is a sentence with 4179639471
我对使用 data.table 和理解它的所有微妙之处还是有点陌生。 我查看了 SO 中的文档和其他示例,但找不到我想要的内容,所以请帮忙!
我有一个 data.table,它基本上是一个字符向量(每个条目都是一个句子)
DT=c("I love you","she loves me")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)
# > DT
# text
# 1: I love you
# 2: she loves me
我想做的是能够在 DT 对象中执行一些基本的字符串操作。例如,添加一个新列,其中我将有一个 char 向量,其中每个条目都是来自 "text" 列中字符串的 WORD。
所以我想有一个新列 charvec where
> DT[1]$charvec
[1] "I" "love "you"
当然,我想以 data.table 的方式来做,超快,因为我需要在>1Go文件的fils上做这种事情,并使用更复杂的计算-繁重的功能。所以没有使用APPLY,LAPPLY,和MAPPLY
我最接近的尝试如下:
myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
# > DU2
# text charvec
# 1: I love you I,love,you
# 2: she loves me she,loves,me
例如,为了制作一个删除每个句子的第一个单词的函数,我这样做了
myfun2 <- function(l){l[[1]][-1]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
# > DV2
# text charvec
# 1: I love you love,you
# 2: she loves me loves,me
问题是,在 charvec 列中,我有一个列表而不是向量...
> str(DU2[1]$charvec)
# List of 1
# $ : chr [1:3] "I" "love" "you"
1) 我怎样才能做我想做的事? 我想使用的其他类型的函数是对 char 向量进行子集化,或者对其应用一些哈希等。
2) 顺便说一句,我可以一行而不是两行到达 DU2 或 DV2 吗?
3) 我不太理解 data.table 的语法。为什么在 [..] 中使用命令 list()
,V1 列消失了?
4) 在另一个线程上,我阅读了一些关于函数 cSplit
.
。有什么好处吗?是适配data.table个对象的函数吗?
非常感谢
更新
感谢@Ananda Mahto 或许我应该让自己更清楚自己的终极目标objective 我有一个包含 10,000,000 个句子的巨大文件,这些句子存储为字符串。 作为该项目的第一步,我想对每个句子的前 5 个单词进行哈希处理。 10,000,000 个句子甚至不会进入我的记忆,所以我首先将 1,000,000 个句子分成 10 个文件,大约是 10x 1Go 文件。 以下代码在我的笔记本电脑上只需要几分钟的时间来处理一个文件。
library(data.table); library(digest);
num_row=1000000
DT <- fread("sentences.txt",nrows=num_row,header=FALSE,sep="\t",colClasses="character")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)
rawdata <- DT
hash2 <- function(word){ #using library(digest)
as.numeric(paste("0x",digest(word,algo="murmur32"),sep=""))
}
那么,
print(system.time({
colnames(rawdata) <- "sentence"
rawdata <- lapply(rawdata,strsplit," ")
sentences_begin <- lapply(rawdata$sentence,function(x){x[2:6]})
hash_list <- sapply(sentences_begin,hash2)
# remove(rawdata)
})) ## end of print system.time for loading the data
我知道我正在将 R 推向极限,但我正在努力寻找更快的实现,并且我正在考虑 data.table 功能......因此我所有的问题
这里是一个不包括lapply的实现,但它实际上更慢!
print(system.time({
myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
myfun2 <- function(l){l[[1]][2:6]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
rebuildsentence <- function(S){
paste(S,collapse=" ") }
myfun3 <- function(l){hash2(rebuildsentence(l[[1]]))}
DW1 <- DV2[,myfun3(charvec),by=text]
})) #end of system.time
在这个带有数据文件的实现中,没有 lapply,所以我希望散列会更快。然而,因为在每一列中我都有一个列表而不是一个 char 向量,这可能会显着减慢(?)整个过程。
使用上面的第一个代码(使用 lapply
/sapply
)在我的笔记本电脑上花费了 1 个多小时。我希望用更高效的数据结构来加快速度?人们使用 Python、Java 等...在几秒钟内完成类似的工作。
当然,另一条路是找到一个更快的散列函数,但我假设 digest
包中的那个已经被优化了。
我不太确定你想要什么,但你可以尝试 cSplit_l
从我的 "splitstackshape" 包中进入你的列表栏:
library(splitstackshape)
DU <- cSplit_l(DT, "DT", " ")
然后,您可以编写如下函数从列表列中删除值:
RemovePos <- function(inList, pos = 1) {
lapply(inList, function(x) x[-c(pos[pos <= length(x)])])
}
用法示例:
DU[, list(RemovePos(DT_list, 1)), by = DT]
# DT V1
# 1: I love you love,you
# 2: she loves me loves,me
DU[, list(RemovePos(DT_list, 2)), by = DT]
# DT V1
# 1: I love you I,you
# 2: she loves me she,me
DU[, list(RemovePos(DT_list, c(1, 2))), by = DT]
# DT V1
# 1: I love you you
# 2: she loves me me
更新
基于你对 `lapply 的厌恶,也许你可以尝试如下操作:
## make a copy of your "text" column
DT[, vals := text]
## Use `cSplit` to create a "long" dataset.
## Add a column to indicate the word's position in the text.
DTL <- cSplit(DT, "vals", " ", "long")[, ind := sequence(.N), by = text][]
DTL
# text vals ind
# 1: I love you I 1
# 2: I love you love 2
# 3: I love you you 3
# 4: she loves me she 1
# 5: she loves me loves 2
# 6: she loves me me 3
## Now, you can extract values easily
DTL[ind == 1]
# text vals ind
# 1: I love you I 1
# 2: she loves me she 1
DTL[ind %in% c(1, 3)]
# text vals ind
# 1: I love you I 1
# 2: I love you you 3
# 3: she loves me she 1
# 4: she loves me me 3
更新 2
我不知道你得到的是什么类型的计时,但正如我在评论中提到的,你也许可以尝试使用正则表达式,这样你就不必拆分字符串然后将其粘贴回一起。
这是一个示例....
设置一些数据来玩:
library(data.table)
DT <- data.table(
text = c("This is a sentence with a lot of words.",
"This is a sentence with some more words.",
"Words and words and even some more words.",
"But, I don't know how you want to deal with punctuation...",
"Just one more sentence, for easy multiplication.")
)
DT2 <- rbindlist(replicate(10000/nrow(DT), DT, FALSE))
DT3 <- rbindlist(replicate(1000000/nrow(DT), DT, FALSE))
测试 gsub 模式以从每个句子中提取 5 个单词....
## Regex to extract first five words -- this should work....
patt <- "^((?:\S+\s+){4}\S+).*"
## Check out some of the timings
system.time(temp <- DT2[, gsub(patt, "\1", text)])
# user system elapsed
# 0.03 0.00 0.03
system.time(temp2 <- DT3[, gsub(patt, "\1", text)])
# user system elapsed
# 3 0 3
head(temp)
# [1] "This is a sentence with" "This is a sentence with" "Words and words and even"
# [4] "But, I don't know how" "Just one more sentence, for" "This is a sentence with"
我猜你想做什么....
## I'm assuming you want something like this....
## Takes about a minute on my system.
## ... but note the system time for the creation of "temp2" (without digest)
## Not sure if I interpreted your hash requirement correctly....
system.time(out <- DT3[
, firstFive := gsub(patt, "\1", text)][
, firstFiveHash := hash2(firstFive), by = 1:nrow(DT3)][])
# user system elapsed
# 62.14 0.05 62.20
head(out)
# text firstFive firstFiveHash
# 1: This is a sentence with a lot of words. This is a sentence with 4179639471
# 2: This is a sentence with some more words. This is a sentence with 4179639471
# 3: Words and words and even some more words. Words and words and even 2556713080
# 4: But, I don't know how you want to deal with punctuation... But, I don't know how 3765680401
# 5: Just one more sentence, for easy multiplication. Just one more sentence, for 298317689
# 6: This is a sentence with a lot of words. This is a sentence with 4179639471