提取不规则长度的字符串:引文中作者数量未知
Extracting string of irregular lengths : Unknown number of Authors from Citations
来自 Web of Science 我已经在 textfile 中下载了 500 篇文章引用。只有作者列 (AU) 被读入 R。变量包含 Author1 到 AuthorN,用分号分隔:
Anselin, L; Fujita, M; Thisse, JF
我想在不同的列中提取 Author1、Author2、Author3...AuthorN。在我的文件中,我最多有 10 位作者。在此示例中,最多 7 位作者:
#Sample of Data
data <- c("Anselin, L; Varga, A; Acs, Z",
"Acs, ZJ; Anselin, L; Varga, A",
"Anselin, L",
"Fujita, M; Thisse, JF",
"Turner, RK; van den Bergh, JCJM; Soderqvist, T; Barendregt, A; van der Straaten, J; Maltby, E; van Ierland, EC",
"Talen, E; Anselin, L",
"Irwin, EG; Bockstael, NE",
"Leggett, CG; Bockstael, NE",
"Guimaraes, P; Figueiredo, O; Woodward, D",
"Halpern, Benjamin S.; McLeod, Karen L.; Rosenberg, Andrew A.; Crowder, Larry B.")
我试过很多方法:
#Method3 - Read table : Not same amount of elements
Meth3 <- read.table(textConnection(data), sep=";", stringsAsFactors=FALSE)
#Method2 - Separate in different column : repeats the Names
Meth2 <- do.call(rbind,
strsplit(gsub(";",
"\1NONSENSESPLIT\2NONSENSESPLIT\3", data),
"NONSENSESPLIT"))
#Method5 - Split row entries, make an identifier and recombine them later : Struggle to recombine
Meth5 <- strsplit(data, ";")
i <- 0
id <- unlist( sapply( Meth5, function(r) rep(i<<-i+1, length(r) ) ) )
x <- unlist(Meth5, recursive = FALSE )
x <- list(do.call(rbind,
strsplit(gsub(";",
"\1NONSENSESPLIT\2NONSENSESPLIT\3", x),
"NONSENSESPLIT")))
require(data.table)
data.table( ID=id, do.call(rbind,x))
#Method6: Identifies first Author :
Meth6 <- gsub("[^a-zA-Z0-9 ]","",strsplit(data,"\; ")[[1]][[1]])
任何关于组织和识别 Authors1...AuthorsN 的建议都受到热烈欢迎。
read.csv
对此有支持:
read.csv(text=data,header=FALSE,sep=";")
V1 V2 V3 V4 V5 V6 V7
1 Anselin, L Varga, A Acs, Z
2 Acs, ZJ Anselin, L Varga, A
3 Anselin, L
4 Fujita, M Thisse, JF
5 Turner, RK van den Bergh, JCJM Soderqvist, T Barendregt, A van der Straaten, J Maltby, E van Ierland, EC
6 Talen, E Anselin, L
7 Irwin, EG Bockstael, NE
8 Leggett, CG Bockstael, NE
9 Guimaraes, P Figueiredo, O Woodward, D
10 Halpern, Benjamin S. McLeod, Karen L. Rosenberg, Andrew A. Crowder, Larry B.
来自 Web of Science 我已经在 textfile 中下载了 500 篇文章引用。只有作者列 (AU) 被读入 R。变量包含 Author1 到 AuthorN,用分号分隔:
Anselin, L; Fujita, M; Thisse, JF
我想在不同的列中提取 Author1、Author2、Author3...AuthorN。在我的文件中,我最多有 10 位作者。在此示例中,最多 7 位作者:
#Sample of Data
data <- c("Anselin, L; Varga, A; Acs, Z",
"Acs, ZJ; Anselin, L; Varga, A",
"Anselin, L",
"Fujita, M; Thisse, JF",
"Turner, RK; van den Bergh, JCJM; Soderqvist, T; Barendregt, A; van der Straaten, J; Maltby, E; van Ierland, EC",
"Talen, E; Anselin, L",
"Irwin, EG; Bockstael, NE",
"Leggett, CG; Bockstael, NE",
"Guimaraes, P; Figueiredo, O; Woodward, D",
"Halpern, Benjamin S.; McLeod, Karen L.; Rosenberg, Andrew A.; Crowder, Larry B.")
我试过很多方法:
#Method3 - Read table : Not same amount of elements
Meth3 <- read.table(textConnection(data), sep=";", stringsAsFactors=FALSE)
#Method2 - Separate in different column : repeats the Names
Meth2 <- do.call(rbind,
strsplit(gsub(";",
"\1NONSENSESPLIT\2NONSENSESPLIT\3", data),
"NONSENSESPLIT"))
#Method5 - Split row entries, make an identifier and recombine them later : Struggle to recombine
Meth5 <- strsplit(data, ";")
i <- 0
id <- unlist( sapply( Meth5, function(r) rep(i<<-i+1, length(r) ) ) )
x <- unlist(Meth5, recursive = FALSE )
x <- list(do.call(rbind,
strsplit(gsub(";",
"\1NONSENSESPLIT\2NONSENSESPLIT\3", x),
"NONSENSESPLIT")))
require(data.table)
data.table( ID=id, do.call(rbind,x))
#Method6: Identifies first Author :
Meth6 <- gsub("[^a-zA-Z0-9 ]","",strsplit(data,"\; ")[[1]][[1]])
任何关于组织和识别 Authors1...AuthorsN 的建议都受到热烈欢迎。
read.csv
对此有支持:
read.csv(text=data,header=FALSE,sep=";")
V1 V2 V3 V4 V5 V6 V7
1 Anselin, L Varga, A Acs, Z
2 Acs, ZJ Anselin, L Varga, A
3 Anselin, L
4 Fujita, M Thisse, JF
5 Turner, RK van den Bergh, JCJM Soderqvist, T Barendregt, A van der Straaten, J Maltby, E van Ierland, EC
6 Talen, E Anselin, L
7 Irwin, EG Bockstael, NE
8 Leggett, CG Bockstael, NE
9 Guimaraes, P Figueiredo, O Woodward, D
10 Halpern, Benjamin S. McLeod, Karen L. Rosenberg, Andrew A. Crowder, Larry B.