有没有办法从 movielens 读取 .dat 文件到 R studio
Is there anyway to read .dat file from movielens to R studio
我正在尝试使用 R Studio 中的导入数据集从 movielens 中读取 ratings.dat。
基本上它有这样的格式:
1::1::5::978824268
1::1022::5::978300055
1::1028::5::978301777
1::1029::5::978302205
1::1035::5::978301753
所以我需要用 : 或 ' 或空格等替换 ::。我使用 notepad++,它有助于非常快速地加载文件(与 note 比较)并且可以轻松查看非常大的文件。但是,当我进行替换时,它显示了一些奇怪的字符:
"LF"
当我在这里做一些研究时,它说它是 \n(换行或换行)。但我不知道为什么当它加载文件时,它不显示这些,只有当我进行替换时它们才会出现。当我加载到 R Studio 时,它仍然检测为 "LF",而不是换行符并导致数据读取错误。
解决方案是什么?谢谢 !
PS: 我知道有 python 代码可以转换它,但我不想使用它,还有其他方法吗?
试试这个:
url <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
## this part is agonizingly slow
tf <- tempfile()
download.file(url,tf, mode="wb") # download archived movielens data
files <- unzip(tf, exdir=tempdir()) # unzips and returns a vector of file names
ratings <- readLines(files[grepl("ratings.dat$",files)]) # read rating.dat file
ratings <- gsub("::", "\t", ratings)
# this part is much faster
library(data.table)
ratings <- fread(paste(ratings, collapse="\n"), sep="\t")
# Read 10000054 rows and 4 (of 4) columns from 0.219 GB file in 00:00:07
head(ratings)
# V1 V2 V3 V4
# 1: 1 122 5 838985046
# 2: 1 185 5 838983525
# 3: 1 231 5 838983392
# 4: 1 292 5 838983421
# 5: 1 316 5 838983392
# 6: 1 329 5 838983392
或者(使用 jlhoward 的 d/l 代码,但他也更新了他的代码以不使用内置函数并在我写这篇文章时切换到 data.table,但我的仍然 faster/more 高效:-)
library(data.table)
# i try not to use variable names that stomp on function names in base
URL <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
# this will be "ml-10m.zip"
fil <- basename(URL)
# this will download to getwd() since you prbly want easy access to
# the files after the machinations. the nice thing about this is
# that it won't re-download the file and waste bandwidth
if (!file.exists(fil)) download.file(URL, fil)
# this will create the "ml-10M100K" dir in getwd(). if using
# R 3.2+ you can do a dir.exists() test to avoid re-doing the unzip
# (which is useful for large archives or archives compressed with a
# more CPU-intensive algorithm)
unzip(fil)
# fast read and slicing of the input
# fread will only spit on a single delimiter so the initial fread
# will create a few blank columns. the [] expression filters those
# out. the "with=FALSE" is part of the data.table inanity
mov <- fread("ml-10M100K/ratings.dat", sep=":")[, c(1,3,5,7), with=FALSE]
# saner column names, set efficiently via data.table::setnames
setnames(mov, c("user_id", "movie_id", "tag", "timestamp"))
mov
## user_id movie_id tag timestamp
## 1: 1 122 5 838985046
## 2: 1 185 5 838983525
## 3: 1 231 5 838983392
## 4: 1 292 5 838983421
## 5: 1 316 5 838983392
## ---
## 10000050: 71567 2107 1 912580553
## 10000051: 71567 2126 2 912649143
## 10000052: 71567 2294 5 912577968
## 10000053: 71567 2338 2 912578016
## 10000054: 71567 2384 2 912578173
它比内置函数快很多。
@hrbrmstr 回答的小改进:
mov <- fread("ml-10M100K/ratings.dat", sep=":", select=c(1,3,5,7))
我正在尝试使用 R Studio 中的导入数据集从 movielens 中读取 ratings.dat。 基本上它有这样的格式:
1::1::5::978824268
1::1022::5::978300055
1::1028::5::978301777
1::1029::5::978302205
1::1035::5::978301753
所以我需要用 : 或 ' 或空格等替换 ::。我使用 notepad++,它有助于非常快速地加载文件(与 note 比较)并且可以轻松查看非常大的文件。但是,当我进行替换时,它显示了一些奇怪的字符:
"LF"
当我在这里做一些研究时,它说它是 \n(换行或换行)。但我不知道为什么当它加载文件时,它不显示这些,只有当我进行替换时它们才会出现。当我加载到 R Studio 时,它仍然检测为 "LF",而不是换行符并导致数据读取错误。
解决方案是什么?谢谢 ! PS: 我知道有 python 代码可以转换它,但我不想使用它,还有其他方法吗?
试试这个:
url <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
## this part is agonizingly slow
tf <- tempfile()
download.file(url,tf, mode="wb") # download archived movielens data
files <- unzip(tf, exdir=tempdir()) # unzips and returns a vector of file names
ratings <- readLines(files[grepl("ratings.dat$",files)]) # read rating.dat file
ratings <- gsub("::", "\t", ratings)
# this part is much faster
library(data.table)
ratings <- fread(paste(ratings, collapse="\n"), sep="\t")
# Read 10000054 rows and 4 (of 4) columns from 0.219 GB file in 00:00:07
head(ratings)
# V1 V2 V3 V4
# 1: 1 122 5 838985046
# 2: 1 185 5 838983525
# 3: 1 231 5 838983392
# 4: 1 292 5 838983421
# 5: 1 316 5 838983392
# 6: 1 329 5 838983392
或者(使用 jlhoward 的 d/l 代码,但他也更新了他的代码以不使用内置函数并在我写这篇文章时切换到 data.table,但我的仍然 faster/more 高效:-)
library(data.table)
# i try not to use variable names that stomp on function names in base
URL <- "http://files.grouplens.org/datasets/movielens/ml-10m.zip"
# this will be "ml-10m.zip"
fil <- basename(URL)
# this will download to getwd() since you prbly want easy access to
# the files after the machinations. the nice thing about this is
# that it won't re-download the file and waste bandwidth
if (!file.exists(fil)) download.file(URL, fil)
# this will create the "ml-10M100K" dir in getwd(). if using
# R 3.2+ you can do a dir.exists() test to avoid re-doing the unzip
# (which is useful for large archives or archives compressed with a
# more CPU-intensive algorithm)
unzip(fil)
# fast read and slicing of the input
# fread will only spit on a single delimiter so the initial fread
# will create a few blank columns. the [] expression filters those
# out. the "with=FALSE" is part of the data.table inanity
mov <- fread("ml-10M100K/ratings.dat", sep=":")[, c(1,3,5,7), with=FALSE]
# saner column names, set efficiently via data.table::setnames
setnames(mov, c("user_id", "movie_id", "tag", "timestamp"))
mov
## user_id movie_id tag timestamp
## 1: 1 122 5 838985046
## 2: 1 185 5 838983525
## 3: 1 231 5 838983392
## 4: 1 292 5 838983421
## 5: 1 316 5 838983392
## ---
## 10000050: 71567 2107 1 912580553
## 10000051: 71567 2126 2 912649143
## 10000052: 71567 2294 5 912577968
## 10000053: 71567 2338 2 912578016
## 10000054: 71567 2384 2 912578173
它比内置函数快很多。
@hrbrmstr 回答的小改进:
mov <- fread("ml-10M100K/ratings.dat", sep=":", select=c(1,3,5,7))