使用 R 进行特征提取
Feature extraction with R
我的问题是关于特征提取。
我想从我的文本构建一个数据框。
我的数据是:
text <- c("#*TeX: The Program",
"#@Donald E. Knuth",
"#t1986",
"#c",
"#index68",
"",
"#*Foundations of Databases.",
"#@Serge Abiteboul,Richard Hull,Victor Vianu",
"#t1995",
"#c",
"#index69",
"#%1118192",
"#%189",
"#%1088975",
"#%971271",
"#%832272",
"#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")
我的预期输出是:
expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),
id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))
提前感谢您的回答或任何其他建议。
我想我认识这种模式,但您可能需要更清楚地了解我的一两个假设。 (我认为您读取数据的方法可能会阻止对这些假设的需求,但我不确定。)
首先,我将制作 patterns-to-column-name 的正则表达式 "map":
patterns <- c(title = "^#\*", author = "^#@",
year = "^#t", revue = "^#c",
id_paper = "^#index", abstract = "^#%",
mismatch = "^([^#]|#[^*@%tci])")
现在,假设标题始终是其他字段序列中的第一个,我会将向量拆分为 per-title 个向量列表:
titles <- split(text, cumsum(grepl("^#\*", text)))
str(titles)
# List of 2
# $ 1: chr [1:6] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" ...
# $ 2: chr [1:11] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" ...
现在快点 helper-function:
standardize_title <- function(x) {
o <- lapply(patterns, function(ptn) paste(x[grepl(ptn, x)], collapse = ", "))
o[nchar(o) == 0] <- NA_character_
o
}
现在将该函数应用于 titles
的每个标题:
do.call(rbind.data.frame, c(stringsAsFactors=FALSE, lapply(titles, standardize_title)))
# title author year revue id_paper abstract mismatch
# 1 #*TeX: The Program #@Donald E. Knuth #t1986 #c #index68 <NA> <NA>
# 2 #*Foundations of Databases. #@Serge Abiteboul,Richard Hull,Victor Vianu #t1995 #c #index69 #%1118192, #%189, #%1088975, #%971271, #%832272 #!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.
(也可以使用 dplyr::bind_rows
或 data.table::rbindlist
代替 do.call(rbind.data.frame, ...)
。)
最大的假设是标题始终是模式中的第一个。如果这不是真的,那么您将得到不正确的结果,尽管没有警告或错误。
我的问题是关于特征提取。
我想从我的文本构建一个数据框。
我的数据是:
text <- c("#*TeX: The Program",
"#@Donald E. Knuth",
"#t1986",
"#c",
"#index68",
"",
"#*Foundations of Databases.",
"#@Serge Abiteboul,Richard Hull,Victor Vianu",
"#t1995",
"#c",
"#index69",
"#%1118192",
"#%189",
"#%1088975",
"#%971271",
"#%832272",
"#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")
我的预期输出是:
expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),
id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))
提前感谢您的回答或任何其他建议。
我想我认识这种模式,但您可能需要更清楚地了解我的一两个假设。 (我认为您读取数据的方法可能会阻止对这些假设的需求,但我不确定。)
首先,我将制作 patterns-to-column-name 的正则表达式 "map":
patterns <- c(title = "^#\*", author = "^#@",
year = "^#t", revue = "^#c",
id_paper = "^#index", abstract = "^#%",
mismatch = "^([^#]|#[^*@%tci])")
现在,假设标题始终是其他字段序列中的第一个,我会将向量拆分为 per-title 个向量列表:
titles <- split(text, cumsum(grepl("^#\*", text)))
str(titles)
# List of 2
# $ 1: chr [1:6] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" ...
# $ 2: chr [1:11] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" ...
现在快点 helper-function:
standardize_title <- function(x) {
o <- lapply(patterns, function(ptn) paste(x[grepl(ptn, x)], collapse = ", "))
o[nchar(o) == 0] <- NA_character_
o
}
现在将该函数应用于 titles
的每个标题:
do.call(rbind.data.frame, c(stringsAsFactors=FALSE, lapply(titles, standardize_title)))
# title author year revue id_paper abstract mismatch
# 1 #*TeX: The Program #@Donald E. Knuth #t1986 #c #index68 <NA> <NA>
# 2 #*Foundations of Databases. #@Serge Abiteboul,Richard Hull,Victor Vianu #t1995 #c #index69 #%1118192, #%189, #%1088975, #%971271, #%832272 #!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.
(也可以使用 dplyr::bind_rows
或 data.table::rbindlist
代替 do.call(rbind.data.frame, ...)
。)
最大的假设是标题始终是模式中的第一个。如果这不是真的,那么您将得到不正确的结果,尽管没有警告或错误。