将冒号分隔的列表解析为 data.frame
Parse colon-separated list into data.frame
此问题是 this 的后续问题。
以下 metadata.txt
由以下人员生成:
pdftk sample.pdf dump_data > metadata.txt
metadata.txt:
InfoBegin
InfoKey: ModDate
InfoValue: D:20170817080316Z00'00'
InfoBegin
InfoKey: CreationDate
InfoValue: D:20170817080316Z00'00'
InfoBegin
InfoKey: Creator
InfoValue: Adobe Acrobat 7.0
InfoBegin
InfoKey: Producer
InfoValue: Mac OS X 10.9.5 Quartz PDFContext
PdfID0: 76cf9fd41f0778314abfec8b34d8388d
PdfID1: 76cf9fd41f0778314abfec8b34d8388d
NumberOfPages: 612
BookmarkBegin
BookmarkTitle: Contents
BookmarkLevel: 1
BookmarkPageNumber: 11
BookmarkBegin
BookmarkTitle: Preface
BookmarkLevel: 1
BookmarkPageNumber: 5
BookmarkBegin
BookmarkTitle: Explanatory Note and Abbreviations Used
BookmarkLevel: 1
BookmarkPageNumber: 7
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 405 616
PageMediaDimensions: 405 616
我希望 R 将 Table-of-Contents (TOC) 信息从 metadata.txt
读入 data.frame,从第一个 BookmarkBegin
开始到BookmarkPageNumber
紧接在 PageMediaBegin
之前。
可以通过以下代码筛选感兴趣的区域:
require(stringi)
connect=file('metadata.txt')
metadata=readLines(connect)
existing_toc=c(min(grep('BookmarkBegin', metadata)),max(grep('BookmarkPageNumber', metadata)))
metadata_toc=metadata[existing_toc[1]:existing_toc[2]]
删除 BookmarkBegin
并通过第一次出现的 :
拆分每行的字符串:
toc_data=metadata_toc[-grep('BookmarkBegin', metadata_toc)]
toc_data_split=stri_split_fixed(toc_data, ": ", n=2)
让我看到以下列表:
[[1]]
[1] "BookmarkTitle" "Contents"
[[2]]
[1] "BookmarkLevel" "1"
[[3]]
[1] "BookmarkPageNumber" "11"
[[4]]
[1] "BookmarkTitle" "Preface "
[[5]]
[1] "BookmarkLevel" "1"
[[6]]
[1] "BookmarkPageNumber" "5"
[[7]]
[1] "BookmarkTitle"
[2] "Explanatory Note and Abbreviations Used "
[[8]]
[1] "BookmarkLevel" "1"
[[9]]
[1] "BookmarkPageNumber" "7"
我应该如何从这里继续得到一个 data.frame 像这样:
structure(list(BookmarkTitle = structure(c(1L, 3L, 2L), .Label = c("Contents",
"Explanatory Note and Abbreviations Used", "Preface"), class = "factor"),
BookmarkLevel = c(1, 1, 1), BookMarkPageNumber = c(11, 5,
7)), .Names = c("BookmarkTitle", "BookmarkLevel", "BookMarkPageNumber"
), row.names = c(NA, -3L), class = "data.frame")
BookmarkTitle BookmarkLevel
1 Contents 1
2 Preface 1
3 Explanatory Note and Abbreviations Used 1
BookMarkPageNumber
1 11
2 5
3 7
此代码应将 metadata_toc
转换为所需的数据帧格式。
(编辑 - 更新代码以合并 BookmarkTitle
也具有 :
作为值的场景)
library(tidyverse)
library(stringi)
df <- data.frame(txt = metadata_toc) %>%
filter(txt != 'BookmarkBegin') %>% #filter unwanted text - 'BookmarkBegin'
#based on first occurrence of ':' split 'txt' column into two new columns
rowwise() %>%
mutate(txt_1 = stri_split_fixed(txt, ': ', n=2)[[1]][1],
txt_2 = stri_split_fixed(txt, ': ', n=2)[[1]][2]) %>%
select(-txt) %>%
ungroup() %>%
#new column 'row_num' helps 'spread' (i.e. next line) know that every 3 subsequent rows are to be spread into 3 columns in a single row.
mutate(row_num = rep(1:(n()/3), each = 3)) %>%
#rep(...) means that 9 (=n() i.e. number of total rows) rows in this sample data is divided into 3 groups as we want to finally convert it into 3 rows.
#rep(1:3, each=3)
#[1] 1 1 1 2 2 2 3 3 3
spread(txt_1, txt_2) %>% #convert data to wide format
select(c("BookmarkTitle", "BookmarkLevel", "BookmarkPageNumber"))
df
输出为:
BookmarkTitle BookmarkLevel BookmarkPageNumber
1 Contents 1 11
2 "Preface " 1 5
3 "Explanatory Note: Abbreviations Used " 1 7
示例数据:
metadata_toc <- c("BookmarkBegin", "BookmarkTitle: Contents", "BookmarkLevel: 1",
"BookmarkPageNumber: 11", "BookmarkBegin", "BookmarkTitle: Preface ",
"BookmarkLevel: 1", "BookmarkPageNumber: 5", "BookmarkBegin",
"BookmarkTitle: Explanatory Note: Abbreviations Used ", "BookmarkLevel: 1",
"BookmarkPageNumber: 7")
此基本解决方案会将 metadata_toc
转换为数据框。首先用空行替换没有冒号的每一行。它现在是 Debian 控制文件 (DCF) 格式,因此请使用 read.dcf
阅读它。将生成的矩阵 m
转换为数据框 DF
并将列类型转换为字符和数字。
metadata_toc[grep(":", metadata_toc, invert = TRUE)] <- ""
m <- read.dcf(textConnection(metadata_toc))
DF <- as.data.frame(m, stringsAsFactors = FALSE)
DF[] <- lapply(DF, type.convert, as.is = TRUE)
给予:
> DF
BookmarkTitle BookmarkLevel BookmarkPageNumber
1 Contents 1 11
2 Preface 1 5
3 Explanatory Note and Abbreviations Used 1 7
备注
metadata_toc <- c("BookmarkBegin", "BookmarkTitle: Contents", "BookmarkLevel: 1",
"BookmarkPageNumber: 11", "BookmarkBegin", "BookmarkTitle: Preface ",
"BookmarkLevel: 1", "BookmarkPageNumber: 5", "BookmarkBegin",
"BookmarkTitle: Explanatory Note and Abbreviations Used ", "BookmarkLevel: 1",
"BookmarkPageNumber: 7")
此问题是 this 的后续问题。
以下 metadata.txt
由以下人员生成:
pdftk sample.pdf dump_data > metadata.txt
metadata.txt:
InfoBegin
InfoKey: ModDate
InfoValue: D:20170817080316Z00'00'
InfoBegin
InfoKey: CreationDate
InfoValue: D:20170817080316Z00'00'
InfoBegin
InfoKey: Creator
InfoValue: Adobe Acrobat 7.0
InfoBegin
InfoKey: Producer
InfoValue: Mac OS X 10.9.5 Quartz PDFContext
PdfID0: 76cf9fd41f0778314abfec8b34d8388d
PdfID1: 76cf9fd41f0778314abfec8b34d8388d
NumberOfPages: 612
BookmarkBegin
BookmarkTitle: Contents
BookmarkLevel: 1
BookmarkPageNumber: 11
BookmarkBegin
BookmarkTitle: Preface
BookmarkLevel: 1
BookmarkPageNumber: 5
BookmarkBegin
BookmarkTitle: Explanatory Note and Abbreviations Used
BookmarkLevel: 1
BookmarkPageNumber: 7
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 405 616
PageMediaDimensions: 405 616
我希望 R 将 Table-of-Contents (TOC) 信息从 metadata.txt
读入 data.frame,从第一个 BookmarkBegin
开始到BookmarkPageNumber
紧接在 PageMediaBegin
之前。
可以通过以下代码筛选感兴趣的区域:
require(stringi)
connect=file('metadata.txt')
metadata=readLines(connect)
existing_toc=c(min(grep('BookmarkBegin', metadata)),max(grep('BookmarkPageNumber', metadata)))
metadata_toc=metadata[existing_toc[1]:existing_toc[2]]
删除 BookmarkBegin
并通过第一次出现的 :
拆分每行的字符串:
toc_data=metadata_toc[-grep('BookmarkBegin', metadata_toc)]
toc_data_split=stri_split_fixed(toc_data, ": ", n=2)
让我看到以下列表:
[[1]]
[1] "BookmarkTitle" "Contents"
[[2]]
[1] "BookmarkLevel" "1"
[[3]]
[1] "BookmarkPageNumber" "11"
[[4]]
[1] "BookmarkTitle" "Preface "
[[5]]
[1] "BookmarkLevel" "1"
[[6]]
[1] "BookmarkPageNumber" "5"
[[7]]
[1] "BookmarkTitle"
[2] "Explanatory Note and Abbreviations Used "
[[8]]
[1] "BookmarkLevel" "1"
[[9]]
[1] "BookmarkPageNumber" "7"
我应该如何从这里继续得到一个 data.frame 像这样:
structure(list(BookmarkTitle = structure(c(1L, 3L, 2L), .Label = c("Contents",
"Explanatory Note and Abbreviations Used", "Preface"), class = "factor"),
BookmarkLevel = c(1, 1, 1), BookMarkPageNumber = c(11, 5,
7)), .Names = c("BookmarkTitle", "BookmarkLevel", "BookMarkPageNumber"
), row.names = c(NA, -3L), class = "data.frame")
BookmarkTitle BookmarkLevel
1 Contents 1
2 Preface 1
3 Explanatory Note and Abbreviations Used 1
BookMarkPageNumber
1 11
2 5
3 7
此代码应将 metadata_toc
转换为所需的数据帧格式。
(编辑 - 更新代码以合并 BookmarkTitle
也具有 :
作为值的场景)
library(tidyverse)
library(stringi)
df <- data.frame(txt = metadata_toc) %>%
filter(txt != 'BookmarkBegin') %>% #filter unwanted text - 'BookmarkBegin'
#based on first occurrence of ':' split 'txt' column into two new columns
rowwise() %>%
mutate(txt_1 = stri_split_fixed(txt, ': ', n=2)[[1]][1],
txt_2 = stri_split_fixed(txt, ': ', n=2)[[1]][2]) %>%
select(-txt) %>%
ungroup() %>%
#new column 'row_num' helps 'spread' (i.e. next line) know that every 3 subsequent rows are to be spread into 3 columns in a single row.
mutate(row_num = rep(1:(n()/3), each = 3)) %>%
#rep(...) means that 9 (=n() i.e. number of total rows) rows in this sample data is divided into 3 groups as we want to finally convert it into 3 rows.
#rep(1:3, each=3)
#[1] 1 1 1 2 2 2 3 3 3
spread(txt_1, txt_2) %>% #convert data to wide format
select(c("BookmarkTitle", "BookmarkLevel", "BookmarkPageNumber"))
df
输出为:
BookmarkTitle BookmarkLevel BookmarkPageNumber
1 Contents 1 11
2 "Preface " 1 5
3 "Explanatory Note: Abbreviations Used " 1 7
示例数据:
metadata_toc <- c("BookmarkBegin", "BookmarkTitle: Contents", "BookmarkLevel: 1",
"BookmarkPageNumber: 11", "BookmarkBegin", "BookmarkTitle: Preface ",
"BookmarkLevel: 1", "BookmarkPageNumber: 5", "BookmarkBegin",
"BookmarkTitle: Explanatory Note: Abbreviations Used ", "BookmarkLevel: 1",
"BookmarkPageNumber: 7")
此基本解决方案会将 metadata_toc
转换为数据框。首先用空行替换没有冒号的每一行。它现在是 Debian 控制文件 (DCF) 格式,因此请使用 read.dcf
阅读它。将生成的矩阵 m
转换为数据框 DF
并将列类型转换为字符和数字。
metadata_toc[grep(":", metadata_toc, invert = TRUE)] <- ""
m <- read.dcf(textConnection(metadata_toc))
DF <- as.data.frame(m, stringsAsFactors = FALSE)
DF[] <- lapply(DF, type.convert, as.is = TRUE)
给予:
> DF
BookmarkTitle BookmarkLevel BookmarkPageNumber
1 Contents 1 11
2 Preface 1 5
3 Explanatory Note and Abbreviations Used 1 7
备注
metadata_toc <- c("BookmarkBegin", "BookmarkTitle: Contents", "BookmarkLevel: 1",
"BookmarkPageNumber: 11", "BookmarkBegin", "BookmarkTitle: Preface ",
"BookmarkLevel: 1", "BookmarkPageNumber: 5", "BookmarkBegin",
"BookmarkTitle: Explanatory Note and Abbreviations Used ", "BookmarkLevel: 1",
"BookmarkPageNumber: 7")