正则表达式在 R 中编译花费太多时间
Regular expression taking too much time to compile in R
我使用
在 ratingsFile 中读取了一个文件
ratingsFile <- readLines("~/ratings.list",encoding = "UTF-8")
文件的前几行看起来像
0000000125 1478759 9.2 The Shawshank Redemption (1994)
0000000125 1014575 9.2 The Godfather (1972)
0000000124 683611 9.0 The Godfather: Part II (1974)
0000000124 1451861 8.9 The Dark Knight (2008)
0000000124 1150611 8.9 Pulp Fiction (1994)
0000000133 750978 8.9 Schindler's List (1993)
使用正则表达式我提取了
match <- gregexpr("[0-9A-Za-z;'$%&?@./]+",ratingsFile)
match <- regmatches(ratingsFile,match)
next_match <- gregexpr("[0-9][.][0-9]+",ratingsFile)
next_match <- regmatches(ratingsFile,next_match)
match 的样本输出看起来像
"0000000125" "1014575" "9.2" "The" "Godfather" "1972"
为了清理数据并更改为我需要的形式,我做了
movies_name <- character(0)
rating <- character(0)
for(i in 1:length(match)){
match[[i]]<-match[[i]][-1:-3] #for removing not need cols
len <- length(match[[i]])
match[[i]]<-match[[i]][-len]#removing last column also not needed
movies_name<-append(movies_name,paste(match[[i]],collapse =" "))
#appending movies name
rating <- append(rating,next_match[[i]])
#appending rating
}
现在这最后一段代码花费的时间太长 execute.I 已经离开他的编译过程数小时但仍未完成,因为文件有 636497 行长。
在这种情况下如何减少编译时间?
试试这个:
ratingsFile <- readLines(n = 6)
0000000125 1478759 9.2 The Shawshank Redemption (1994)
0000000125 1014575 9.2 The Godfather (1972)
0000000124 683611 9.0 The Godfather: Part II (1974)
0000000124 1451861 8.9 The Dark Knight (2008)
0000000124 1150611 8.9 Pulp Fiction (1994)
0000000133 750978 8.9 Schindler's List (1993)
setNames(as.data.frame(t(sapply(regmatches(ratingsFile, regexec("\d{10}\s+\d+\s+([0-9.]+)\s+(.*?)\s\(\d{4}\)", ratingsFile)), "[", -1))), c("rating", "movie_name"))
# rating movie_name
# 1 9.2 The Shawshank Redemption
# 2 9.2 The Godfather
# 3 9.0 The Godfather: Part II
# 4 8.9 The Dark Knight
# 5 8.9 Pulp Fiction
# 6 8.9 Schindler's List
如果你想从你的数据中找到并使用一些数据,我想你可以使用这个正则表达式:
/^ *(\d*) *(\d*) *(\d+\.\d+)(.*)\((\d+)\)$/gm
有替换
- $1 => 第一列
- $2 => 第二列
- $3 => 第三列(可能是评分)
- $4 => 电影名称
- 5 美元 => 电影年份
如果我理解正确你想做什么(只得到电影标题),这里有另一种方法来得到你想要的:
unlist(lapply(strsplit(ratingsFile, "\s{2,}"), # split each line whenever there are at least 2 spaces
function(x){ # for each resulting vector
x <- gsub(" \(\d{4}\)$", "", tail(x, 1)) # keep only the needed part (movie title)
x
}))
# [1] "The Shawshank Redemption" "The Godfather" "The Godfather: Part II" "The Dark Knight" "Pulp Fiction"
# [6] "Schindler's List"
注意: 请注意,您可以将结果向量放在 data.frame and/or 中,保留前几行的其他信息。
我使用
在 ratingsFile 中读取了一个文件ratingsFile <- readLines("~/ratings.list",encoding = "UTF-8")
文件的前几行看起来像
0000000125 1478759 9.2 The Shawshank Redemption (1994)
0000000125 1014575 9.2 The Godfather (1972)
0000000124 683611 9.0 The Godfather: Part II (1974)
0000000124 1451861 8.9 The Dark Knight (2008)
0000000124 1150611 8.9 Pulp Fiction (1994)
0000000133 750978 8.9 Schindler's List (1993)
使用正则表达式我提取了
match <- gregexpr("[0-9A-Za-z;'$%&?@./]+",ratingsFile)
match <- regmatches(ratingsFile,match)
next_match <- gregexpr("[0-9][.][0-9]+",ratingsFile)
next_match <- regmatches(ratingsFile,next_match)
match 的样本输出看起来像
"0000000125" "1014575" "9.2" "The" "Godfather" "1972"
为了清理数据并更改为我需要的形式,我做了
movies_name <- character(0)
rating <- character(0)
for(i in 1:length(match)){
match[[i]]<-match[[i]][-1:-3] #for removing not need cols
len <- length(match[[i]])
match[[i]]<-match[[i]][-len]#removing last column also not needed
movies_name<-append(movies_name,paste(match[[i]],collapse =" "))
#appending movies name
rating <- append(rating,next_match[[i]])
#appending rating
}
现在这最后一段代码花费的时间太长 execute.I 已经离开他的编译过程数小时但仍未完成,因为文件有 636497 行长。
在这种情况下如何减少编译时间?
试试这个:
ratingsFile <- readLines(n = 6)
0000000125 1478759 9.2 The Shawshank Redemption (1994)
0000000125 1014575 9.2 The Godfather (1972)
0000000124 683611 9.0 The Godfather: Part II (1974)
0000000124 1451861 8.9 The Dark Knight (2008)
0000000124 1150611 8.9 Pulp Fiction (1994)
0000000133 750978 8.9 Schindler's List (1993)
setNames(as.data.frame(t(sapply(regmatches(ratingsFile, regexec("\d{10}\s+\d+\s+([0-9.]+)\s+(.*?)\s\(\d{4}\)", ratingsFile)), "[", -1))), c("rating", "movie_name"))
# rating movie_name
# 1 9.2 The Shawshank Redemption
# 2 9.2 The Godfather
# 3 9.0 The Godfather: Part II
# 4 8.9 The Dark Knight
# 5 8.9 Pulp Fiction
# 6 8.9 Schindler's List
如果你想从你的数据中找到并使用一些数据,我想你可以使用这个正则表达式:
/^ *(\d*) *(\d*) *(\d+\.\d+)(.*)\((\d+)\)$/gm
有替换
- $1 => 第一列
- $2 => 第二列
- $3 => 第三列(可能是评分)
- $4 => 电影名称
- 5 美元 => 电影年份
如果我理解正确你想做什么(只得到电影标题),这里有另一种方法来得到你想要的:
unlist(lapply(strsplit(ratingsFile, "\s{2,}"), # split each line whenever there are at least 2 spaces
function(x){ # for each resulting vector
x <- gsub(" \(\d{4}\)$", "", tail(x, 1)) # keep only the needed part (movie title)
x
}))
# [1] "The Shawshank Redemption" "The Godfather" "The Godfather: Part II" "The Dark Knight" "Pulp Fiction"
# [6] "Schindler's List"
注意: 请注意,您可以将结果向量放在 data.frame and/or 中,保留前几行的其他信息。