将非结构化 csv 文件转换为数据框
Convert unstructured csv file to a data frame
我正在学习用于文本挖掘的 R。我有一个 CSV 格式的电视节目表。节目通常从 06:00 AM 开始,一直持续到第二天 05:00 AM,这称为广播日。例如:2015 年 11 月 15 日的节目从 06:00 AM 开始,到第二天 05:00 AM 结束。
这是一个示例代码,显示了时间表的样子:
read.table(textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
其输出如下:
V1|V2
Sunday |
01-Nov-15 |
6 | Tom
some information about the program |
23.3 | Jerry
some information about the program |
5 | Avatar
some information about the program |
5.3 | Panda
some information about the program |
Monday |
02-Nov-15|
6 Jerry
some information about the program |
6.25 | Panda
some information about the program |
23.3 | Avatar
some information about the program |
7.25 | Tom
some information about the program |
我想把上面的数据转换成data.frame
的形式
Date |Program|Synopsis
2015-11-1 06:00 |Tom | some information about the program
2015-11-1 23:30 |Jerry | some information about the program
2015-11-2 05:00 |Avatar | some information about the program
2015-11-2 05:30 |Panda | some information about the program
2015-11-2 06:00 |Jerry | some information about the program
2015-11-2 06:25 |Panda | some information about the program
2015-11-2 23:30 |Avatar | some information about the program
2015-11-3 07:25 |Tom | some information about the program
我很感谢 suggestions/tips 关于我应该看看的功能或包。
有点乱,但似乎可行:
df <- read.table(textConnection(txt <- "Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
cat(txt)
Sys.setlocale("LC_TIME", "English") # if needed
weekdays <- format(seq.Date(Sys.Date(), Sys.Date()+6, 1), "%A")
days <- split(df, cumsum(df$V1 %in% weekdays))
lapply(days, function(dayDF) {
tmp <- cbind.data.frame(V1=dayDF[2, 1], do.call(rbind, split(unlist(dayDF[-c(1:2), ]), cumsum(!dayDF[-(1:2), 2]==""))), stringsAsFactors = F)
tmp[, 1] <- as.Date(tmp[, 1], "%d-%B-%y")
tmp[, 2] <- as.numeric(tmp[, 2])
tmp[, 5] <- NULL
idx <- c(FALSE, diff(tmp[, 2])<0)
tmp[idx, 1] <- tmp[idx, 1] + 1
return(tmp)
}) -> days
days <- transform(do.call(rbind.data.frame, days), V1=as.POSIXct(paste(V1, sprintf("%.2f", V11)), format="%Y-%m-%d %H.%M"), V11=NULL)
names(days) <- c("Date", "Synopsis", "Program")
rownames(days) <- NULL
days[, c(1, 3, 2)]
# Date Program Synopsis
# 1 2015-11-01 06:00:00 Tom some information about the program
# 2 2015-11-01 23:30:00 Jerry some information about the program
# 3 2015-11-02 05:00:00 Avatar some information about the program
# 4 2015-11-02 06:00:00 Tom some information about the program
# 5 2015-11-02 23:30:00 Jerry some information about the program
# 6 2015-11-03 05:00:00 Avatar some information about the program
1) 这设置了一些函数,然后由使用 magrittr 管道链接在一起的四个 transform(...) %>% subset(...)
代码片段组成。我们假设 DF
是问题中 read.table
的输出。
首先,加载 zoo 包以便访问 na.locf
。定义一个 Lead
函数,它将每个元素移动 1 个位置。还要定义一个 datetime
函数,它将日期加上 h.m 数字转换为日期时间。
现在将日期转换为 "Date"
class。不是日期的行将变为 NA。使用 Lead
将该向量移动 1 个位置,然后提取 NA 位置,有效地删除工作日行。现在使用 na.locf
填写日期并仅保留具有重复日期的行,从而有效地删除仅包含日期的行。接下来将 Program
设置为 V1
并将 Synopsis
设置为 V2
除了我们必须使用 Lead
移动 V2
因为 Synopsis
在每对的第二行。只保留奇数行。生成 datetime
并选择所需的列。
library(magrittr)
library(zoo) # needed for na.locf
Lead <- function(x, fill = NA) c(x[-1], fill) # shift down and fill
datetime <- function(date, time) {
time <- as.numeric(time)
as.POSIXct(sprintf("%s %.0f:%02f", date, time, 100 * (time %% 1))) +
24 * 60 * 60 * (time < 6) # add day if time < 6
}
DF %>%
transform(date = as.Date(V1, "%d-%b-%y")) %>%
subset(Lead(is.na(date), TRUE)) %>% # rm weekday rows
transform(date = na.locf(date)) %>% # fill in dates
subset(duplicated(date)) %>% # rm date rows
transform(Program = V2, Synopsis = Lead(V1)) %>%
subset(c(TRUE, FALSE)) %>% # keep odd positioned rows only
transform(Date = datetime(date, V1)) %>%
subset(select = c("Date", "Program", "Synopsis"))
给予:
Date Program Synopsis
1 2015-11-01 06:00:00 Tom some information about the program
2 2015-11-01 23:30:00 Jerry some information about the program
3 2015-11-02 05:00:00 Avatar some information about the program
4 2015-11-02 06:00:00 Tom some information about the program
5 2015-11-02 23:30:00 Jerry some information about the program
6 2015-11-03 05:00:00 Avatar some information about the program
2) dplyr 这里使用的是 dplyr 和上面的 datetime
函数。我们可以将 (1) 中的 transform
和 subset
函数替换为 dplyr mutate
和 filter
以及将 Lead
替换为 lead
但为了多样性,我们换一种方式:
library(dplyr)
library(zoo) # na.locf
DF %>%
mutate(date = as.Date(V1, "%d-%b-%t")) %>%
filter(lead(is.na(date), default = TRUE)) %>% # rm weekday rows
mutate(date = na.locf(date)) %>% # fill in dates
group_by(date) %>%
mutate(Program = V2, Synopsis = lead(V1)) %>%
slice(seq(2, n(), by = 2)) %>%
ungroup() %>%
mutate(Date = datetime(date, V1)) %>%
select(Date, Program, Synopsis)
给予:
Source: local data frame [6 x 3]
Date Program Synopsis
(time) (chr) (chr)
1 2015-11-01 06:00:00 Tom some information about the program
2 2015-11-01 23:30:00 Jerry some information about the program
3 2015-11-02 05:00:00 Avatar some information about the program
4 2015-11-02 06:00:00 Tom some information about the program
5 2015-11-02 23:30:00 Jerry some information about the program
6 2015-11-03 05:00:00 Avatar some information about the program
3) data.table 这也使用来自动物园的 na.locf
和 (1) 中定义的 datetime
:
library(data.table)
library(zoo)
dt <- data.table(DF)
dt <- dt[, date := as.Date(V1, "%d-%b-%y")][
shift(is.na(date), type = "lead", fill = TRUE)][, # rm weekday rows
date := na.locf(date)][duplicated(date)][, # fill in dates & rm date rows
Synopsis := shift(V1, type = "lead")][seq(1, .N, 2)][, # align Synopsis
c("Date", "Program") := list(datetime(date, V1), V2)][,
list(Date, Program, Synopsis)]
给予:
> dt
Date Program Synopsis
1: 2015-11-01 06:00:00 Tom some information about the program
2: 2015-11-01 23:30:00 Jerry some information about the program
3: 2015-11-02 05:00:00 Avatar some information about the program
4: 2015-11-02 06:00:00 Tom some information about the program
5: 2015-11-02 23:30:00 Jerry some information about the program
6: 2015-11-03 05:00:00 Avatar some information about the program
更新: 简化了 (1) 并添加了 (2) 和 (3)。
data.table 的替代解决方案:
library(data.table)
library(zoo)
library(splitstackshape)
txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]
wd <- levels(weekdays(1:7, abbreviate = FALSE))
DT <- DT[, temp := tv %chin% wd
][, day := tv[temp], by = 1:nrow(tvDT)
][, day := na.locf(day)
][, temp := NULL
][, idx := rleid(day)
][, date := tv[2], by = idx
][, .SD[-c(1,2)], by = idx]
DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
DT <- dcast(DT, idx + day + date + rowid(lbl) ~ lbl, value.var = "tv")[, lbl := NULL]
DT <- DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)
][, .(datetime, Program, Info)]
结果:
> DT
datetime Program Info
1: 2015-11-01 06:00:00 Tom some information about the program
2: 2015-11-01 23:30:00 Jerry some information about the program
3: 2015-11-02 05:00:00 Avatar some information about the program
4: 2015-11-02 06:00:00 Tom some information about the program
5: 2015-11-02 23:30:00 Jerry some information about the program
6: 2015-11-03 05:00:00 Avatar some information about the program
解释:
1: 读取数据,转换为 data.table 并删除尾随 |
:
txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]
2: 将工作日提取到新列中
wd <- levels(weekdays(1:7, abbreviate = FALSE)) # a vector with the full weekdays
DT[, temp := tv %chin% wd
][, day := tv[temp], by = 1:nrow(tvDT)
][, day := na.locf(day)
][, temp := NULL]
3: 每天创建一个索引并创建一个包含日期的列
DT[, idx := rleid(day)][, date := tv[2], by = idx]
4: 删除不必要的行
DT <- DT[, .SD[-c(1,2)], by = idx]
5: 将时间和节目名称拆分成单独的行并创建标签列
DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
6: 使用 data.table
开发版的 'rowid' 函数重塑为宽格式
DT <- dcast(DT, idx + day + date + rowid(idx2) ~ idx2, value.var = "tv")[, idx2 := NULL]
7: 创建一个 dattime 列并将深夜时间设置为第二天
DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)]
8: 保留需要的列
DT <- DT[, .(datetime, Program, Info)]
我正在学习用于文本挖掘的 R。我有一个 CSV 格式的电视节目表。节目通常从 06:00 AM 开始,一直持续到第二天 05:00 AM,这称为广播日。例如:2015 年 11 月 15 日的节目从 06:00 AM 开始,到第二天 05:00 AM 结束。
这是一个示例代码,显示了时间表的样子:
read.table(textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
其输出如下:
V1|V2
Sunday |
01-Nov-15 |
6 | Tom
some information about the program |
23.3 | Jerry
some information about the program |
5 | Avatar
some information about the program |
5.3 | Panda
some information about the program |
Monday |
02-Nov-15|
6 Jerry
some information about the program |
6.25 | Panda
some information about the program |
23.3 | Avatar
some information about the program |
7.25 | Tom
some information about the program |
我想把上面的数据转换成data.frame
的形式Date |Program|Synopsis
2015-11-1 06:00 |Tom | some information about the program
2015-11-1 23:30 |Jerry | some information about the program
2015-11-2 05:00 |Avatar | some information about the program
2015-11-2 05:30 |Panda | some information about the program
2015-11-2 06:00 |Jerry | some information about the program
2015-11-2 06:25 |Panda | some information about the program
2015-11-2 23:30 |Avatar | some information about the program
2015-11-3 07:25 |Tom | some information about the program
我很感谢 suggestions/tips 关于我应该看看的功能或包。
有点乱,但似乎可行:
df <- read.table(textConnection(txt <- "Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)
cat(txt)
Sys.setlocale("LC_TIME", "English") # if needed
weekdays <- format(seq.Date(Sys.Date(), Sys.Date()+6, 1), "%A")
days <- split(df, cumsum(df$V1 %in% weekdays))
lapply(days, function(dayDF) {
tmp <- cbind.data.frame(V1=dayDF[2, 1], do.call(rbind, split(unlist(dayDF[-c(1:2), ]), cumsum(!dayDF[-(1:2), 2]==""))), stringsAsFactors = F)
tmp[, 1] <- as.Date(tmp[, 1], "%d-%B-%y")
tmp[, 2] <- as.numeric(tmp[, 2])
tmp[, 5] <- NULL
idx <- c(FALSE, diff(tmp[, 2])<0)
tmp[idx, 1] <- tmp[idx, 1] + 1
return(tmp)
}) -> days
days <- transform(do.call(rbind.data.frame, days), V1=as.POSIXct(paste(V1, sprintf("%.2f", V11)), format="%Y-%m-%d %H.%M"), V11=NULL)
names(days) <- c("Date", "Synopsis", "Program")
rownames(days) <- NULL
days[, c(1, 3, 2)]
# Date Program Synopsis
# 1 2015-11-01 06:00:00 Tom some information about the program
# 2 2015-11-01 23:30:00 Jerry some information about the program
# 3 2015-11-02 05:00:00 Avatar some information about the program
# 4 2015-11-02 06:00:00 Tom some information about the program
# 5 2015-11-02 23:30:00 Jerry some information about the program
# 6 2015-11-03 05:00:00 Avatar some information about the program
1) 这设置了一些函数,然后由使用 magrittr 管道链接在一起的四个 transform(...) %>% subset(...)
代码片段组成。我们假设 DF
是问题中 read.table
的输出。
首先,加载 zoo 包以便访问 na.locf
。定义一个 Lead
函数,它将每个元素移动 1 个位置。还要定义一个 datetime
函数,它将日期加上 h.m 数字转换为日期时间。
现在将日期转换为 "Date"
class。不是日期的行将变为 NA。使用 Lead
将该向量移动 1 个位置,然后提取 NA 位置,有效地删除工作日行。现在使用 na.locf
填写日期并仅保留具有重复日期的行,从而有效地删除仅包含日期的行。接下来将 Program
设置为 V1
并将 Synopsis
设置为 V2
除了我们必须使用 Lead
移动 V2
因为 Synopsis
在每对的第二行。只保留奇数行。生成 datetime
并选择所需的列。
library(magrittr)
library(zoo) # needed for na.locf
Lead <- function(x, fill = NA) c(x[-1], fill) # shift down and fill
datetime <- function(date, time) {
time <- as.numeric(time)
as.POSIXct(sprintf("%s %.0f:%02f", date, time, 100 * (time %% 1))) +
24 * 60 * 60 * (time < 6) # add day if time < 6
}
DF %>%
transform(date = as.Date(V1, "%d-%b-%y")) %>%
subset(Lead(is.na(date), TRUE)) %>% # rm weekday rows
transform(date = na.locf(date)) %>% # fill in dates
subset(duplicated(date)) %>% # rm date rows
transform(Program = V2, Synopsis = Lead(V1)) %>%
subset(c(TRUE, FALSE)) %>% # keep odd positioned rows only
transform(Date = datetime(date, V1)) %>%
subset(select = c("Date", "Program", "Synopsis"))
给予:
Date Program Synopsis
1 2015-11-01 06:00:00 Tom some information about the program
2 2015-11-01 23:30:00 Jerry some information about the program
3 2015-11-02 05:00:00 Avatar some information about the program
4 2015-11-02 06:00:00 Tom some information about the program
5 2015-11-02 23:30:00 Jerry some information about the program
6 2015-11-03 05:00:00 Avatar some information about the program
2) dplyr 这里使用的是 dplyr 和上面的 datetime
函数。我们可以将 (1) 中的 transform
和 subset
函数替换为 dplyr mutate
和 filter
以及将 Lead
替换为 lead
但为了多样性,我们换一种方式:
library(dplyr)
library(zoo) # na.locf
DF %>%
mutate(date = as.Date(V1, "%d-%b-%t")) %>%
filter(lead(is.na(date), default = TRUE)) %>% # rm weekday rows
mutate(date = na.locf(date)) %>% # fill in dates
group_by(date) %>%
mutate(Program = V2, Synopsis = lead(V1)) %>%
slice(seq(2, n(), by = 2)) %>%
ungroup() %>%
mutate(Date = datetime(date, V1)) %>%
select(Date, Program, Synopsis)
给予:
Source: local data frame [6 x 3]
Date Program Synopsis
(time) (chr) (chr)
1 2015-11-01 06:00:00 Tom some information about the program
2 2015-11-01 23:30:00 Jerry some information about the program
3 2015-11-02 05:00:00 Avatar some information about the program
4 2015-11-02 06:00:00 Tom some information about the program
5 2015-11-02 23:30:00 Jerry some information about the program
6 2015-11-03 05:00:00 Avatar some information about the program
3) data.table 这也使用来自动物园的 na.locf
和 (1) 中定义的 datetime
:
library(data.table)
library(zoo)
dt <- data.table(DF)
dt <- dt[, date := as.Date(V1, "%d-%b-%y")][
shift(is.na(date), type = "lead", fill = TRUE)][, # rm weekday rows
date := na.locf(date)][duplicated(date)][, # fill in dates & rm date rows
Synopsis := shift(V1, type = "lead")][seq(1, .N, 2)][, # align Synopsis
c("Date", "Program") := list(datetime(date, V1), V2)][,
list(Date, Program, Synopsis)]
给予:
> dt
Date Program Synopsis
1: 2015-11-01 06:00:00 Tom some information about the program
2: 2015-11-01 23:30:00 Jerry some information about the program
3: 2015-11-02 05:00:00 Avatar some information about the program
4: 2015-11-02 06:00:00 Tom some information about the program
5: 2015-11-02 23:30:00 Jerry some information about the program
6: 2015-11-03 05:00:00 Avatar some information about the program
更新: 简化了 (1) 并添加了 (2) 和 (3)。
data.table 的替代解决方案:
library(data.table)
library(zoo)
library(splitstackshape)
txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]
wd <- levels(weekdays(1:7, abbreviate = FALSE))
DT <- DT[, temp := tv %chin% wd
][, day := tv[temp], by = 1:nrow(tvDT)
][, day := na.locf(day)
][, temp := NULL
][, idx := rleid(day)
][, date := tv[2], by = idx
][, .SD[-c(1,2)], by = idx]
DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
DT <- dcast(DT, idx + day + date + rowid(lbl) ~ lbl, value.var = "tv")[, lbl := NULL]
DT <- DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)
][, .(datetime, Program, Info)]
结果:
> DT
datetime Program Info
1: 2015-11-01 06:00:00 Tom some information about the program
2: 2015-11-01 23:30:00 Jerry some information about the program
3: 2015-11-02 05:00:00 Avatar some information about the program
4: 2015-11-02 06:00:00 Tom some information about the program
5: 2015-11-02 23:30:00 Jerry some information about the program
6: 2015-11-03 05:00:00 Avatar some information about the program
解释:
1: 读取数据,转换为 data.table 并删除尾随 |
:
txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]
2: 将工作日提取到新列中
wd <- levels(weekdays(1:7, abbreviate = FALSE)) # a vector with the full weekdays
DT[, temp := tv %chin% wd
][, day := tv[temp], by = 1:nrow(tvDT)
][, day := na.locf(day)
][, temp := NULL]
3: 每天创建一个索引并创建一个包含日期的列
DT[, idx := rleid(day)][, date := tv[2], by = idx]
4: 删除不必要的行
DT <- DT[, .SD[-c(1,2)], by = idx]
5: 将时间和节目名称拆分成单独的行并创建标签列
DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
6: 使用 data.table
开发版的 'rowid' 函数重塑为宽格式DT <- dcast(DT, idx + day + date + rowid(idx2) ~ idx2, value.var = "tv")[, idx2 := NULL]
7: 创建一个 dattime 列并将深夜时间设置为第二天
DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)]
8: 保留需要的列
DT <- DT[, .(datetime, Program, Info)]