正则表达式在 R 中编译花费太多时间

Question

我使用

在 ratingsFile 中读取了一个文件

ratingsFile <- readLines("~/ratings.list",encoding = "UTF-8")

文件的前几行看起来像

  0000000125  1478759   9.2  The Shawshank Redemption (1994)
  0000000125  1014575   9.2  The Godfather (1972)
  0000000124  683611   9.0  The Godfather: Part II (1974)
  0000000124  1451861   8.9  The Dark Knight (2008)
  0000000124  1150611   8.9  Pulp Fiction (1994)
  0000000133  750978   8.9  Schindler's List (1993)

使用正则表达式我提取了

  match <- gregexpr("[0-9A-Za-z;'$%&?@./]+",ratingsFile)
  match <- regmatches(ratingsFile,match)


  next_match <- gregexpr("[0-9][.][0-9]+",ratingsFile)
  next_match <- regmatches(ratingsFile,next_match)

match 的样本输出看起来像

  "0000000125" "1014575"    "9.2"        "The"        "Godfather"  "1972"

为了清理数据并更改为我需要的形式，我做了

  movies_name <- character(0)
  rating <- character(0)
  for(i in 1:length(match)){

      match[[i]]<-match[[i]][-1:-3] #for removing not need cols 
      len <- length(match[[i]])
      match[[i]]<-match[[i]][-len]#removing last column also not needed
      movies_name<-append(movies_name,paste(match[[i]],collapse =" "))
      #appending movies name
      rating <- append(rating,next_match[[i]]) 
      #appending rating
}

现在这最后一段代码花费的时间太长 execute.I 已经离开他的编译过程数小时但仍未完成，因为文件有 636497 行长。

在这种情况下如何减少编译时间？

Answer 1

试试这个：

ratingsFile <- readLines(n = 6)
0000000125  1478759   9.2  The Shawshank Redemption (1994)
0000000125  1014575   9.2  The Godfather (1972)
0000000124  683611   9.0  The Godfather: Part II (1974)
0000000124  1451861   8.9  The Dark Knight (2008)
0000000124  1150611   8.9  Pulp Fiction (1994)
0000000133  750978   8.9  Schindler's List (1993)
setNames(as.data.frame(t(sapply(regmatches(ratingsFile, regexec("\d{10}\s+\d+\s+([0-9.]+)\s+(.*?)\s\(\d{4}\)", ratingsFile)), "[", -1))), c("rating", "movie_name"))
#   rating               movie_name
# 1    9.2 The Shawshank Redemption
# 2    9.2            The Godfather
# 3    9.0   The Godfather: Part II
# 4    8.9          The Dark Knight
# 5    8.9             Pulp Fiction
# 6    8.9         Schindler's List

Answer 2

如果你想从你的数据中找到并使用一些数据，我想你可以使用这个正则表达式：

/^ *(\d*) *(\d*) *(\d+\.\d+)(.*)\((\d+)\)$/gm

有替换

$1 => 第一列
$2 => 第二列
$3 => 第三列（可能是评分）
$4 => 电影名称
5 美元 => 电影年份

[Regex Demo]

Answer 3

如果我理解正确你想做什么（只得到电影标题），这里有另一种方法来得到你想要的：

unlist(lapply(strsplit(ratingsFile, "\s{2,}"), # split each line whenever there are at least 2 spaces
                                 function(x){ # for each resulting vector
                                    x <- gsub(" \(\d{4}\)$", "", tail(x, 1)) # keep only the needed part (movie title)
                                    x
                                 }))

# [1] "The Shawshank Redemption" "The Godfather"            "The Godfather: Part II"   "The Dark Knight"          "Pulp Fiction"            
# [6] "Schindler's List"

注意： 请注意，您可以将结果向量放在 data.frame and/or 中，保留前几行的其他信息。

正则表达式在 R 中编译花费太多时间

Regular expression taking too much time to compile in R

regex

r

text-mining

time-complexity