如何使用不同的分隔符将 .txt 文件读入 R,并在行中使用 运行?
How do i read a .txt file into R with different separators, and run on lines?
我有一个大的 .txt 文件,格式如下,显示大量用户的日期、用户和产品评论;
YYYY:MM:D1 @Username1: this is a product review
YYYY:MM:D1 @Username2: this is also a product review
YYYY:MM:D1 @Username3: this is also a product review that
runs to the next line
YYYY:MM:D1 @Username4: this here is also a product review
我想将其提取到具有 3 列的数据框中,如下所示:
date/time username comment
yyyy/mm/dd @Username1 this is a product review
yyyy/mm/dd @Username2 this is also a product review
yyyy/mm/dd @Username3 this is also a product review contained in the same row
yyyy/mm/dd @Username4 this here is also a product review
使用标准的 R 基本命令
read.table("filename.txt", fill=TRUE)
给了我一个数据框,它将产品评论中的每个单词视为不同的列。它还将足够长的评论变成 'run-on lines' 到新行,即
V1 V2 V3 V4 V5
yy/mm/dd Username1 this is a
product review
...
感谢任何帮助!
您可以通过几种不同的方式解决这个问题。一种方法是将数据导入单个列,然后使用 tidyr::separate
或 data.table::strsplit
到 split the column at the appropriate places。这是 tidyr
:
的示例
# Use a separator symbol that is unlikely to appear in the file,
# to read the data into a single column:
data <- read.table("filename.txt", sep = "^")
# First split the column at the @-sign, and then at the ": "-part:
library(tidyr)
data %>% separate(V1,
into = c("Date", "User"),
sep = " @") %>%
separate(User,
into = c("User", "Review"),
sep = ": ") -> data
# If you want to add back the @-sign to the usernames:
data$User <- paste("@", data$User, sep = "")
我有一个大的 .txt 文件,格式如下,显示大量用户的日期、用户和产品评论;
YYYY:MM:D1 @Username1: this is a product review
YYYY:MM:D1 @Username2: this is also a product review
YYYY:MM:D1 @Username3: this is also a product review that
runs to the next line
YYYY:MM:D1 @Username4: this here is also a product review
我想将其提取到具有 3 列的数据框中,如下所示:
date/time username comment
yyyy/mm/dd @Username1 this is a product review
yyyy/mm/dd @Username2 this is also a product review
yyyy/mm/dd @Username3 this is also a product review contained in the same row
yyyy/mm/dd @Username4 this here is also a product review
使用标准的 R 基本命令
read.table("filename.txt", fill=TRUE)
给了我一个数据框,它将产品评论中的每个单词视为不同的列。它还将足够长的评论变成 'run-on lines' 到新行,即
V1 V2 V3 V4 V5
yy/mm/dd Username1 this is a
product review
...
感谢任何帮助!
您可以通过几种不同的方式解决这个问题。一种方法是将数据导入单个列,然后使用 tidyr::separate
或 data.table::strsplit
到 split the column at the appropriate places。这是 tidyr
:
# Use a separator symbol that is unlikely to appear in the file,
# to read the data into a single column:
data <- read.table("filename.txt", sep = "^")
# First split the column at the @-sign, and then at the ": "-part:
library(tidyr)
data %>% separate(V1,
into = c("Date", "User"),
sep = " @") %>%
separate(User,
into = c("User", "Review"),
sep = ": ") -> data
# If you want to add back the @-sign to the usernames:
data$User <- paste("@", data$User, sep = "")