提取不规则长度的字符串:引文中作者数量未知

Extracting string of irregular lengths : Unknown number of Authors from Citations

来自 Web of Science 我已经在 textfile 中下载了 500 篇文章引用。只有作者列 (AU) 被读入 R。变量包含 Author1 到 AuthorN,用分号分隔:

Anselin, L; Fujita, M; Thisse, JF

我想在不同的列中提取 Author1、Author2、Author3...AuthorN。在我的文件中,我最多有 10 位作者。在此示例中,最多 7 位作者:

 #Sample of Data
    data <- c("Anselin, L; Varga, A; Acs, Z",
    "Acs, ZJ; Anselin, L; Varga, A",
    "Anselin, L",
    "Fujita, M; Thisse, JF",
    "Turner, RK; van den Bergh, JCJM; Soderqvist, T; Barendregt, A; van der Straaten, J; Maltby, E; van Ierland, EC",
    "Talen, E; Anselin, L",
    "Irwin, EG; Bockstael, NE",
    "Leggett, CG; Bockstael, NE",
    "Guimaraes, P; Figueiredo, O; Woodward, D",
    "Halpern, Benjamin S.; McLeod, Karen L.; Rosenberg, Andrew A.; Crowder, Larry B.")

我试过很多方法:

      #Method3 - Read table : Not same amount of elements
            Meth3 <- read.table(textConnection(data), sep=";", stringsAsFactors=FALSE)

      #Method2 - Separate in different column : repeats the Names
        Meth2 <- do.call(rbind, 
                          strsplit(gsub(";", 
                                        "\1NONSENSESPLIT\2NONSENSESPLIT\3", data),
                                   "NONSENSESPLIT"))


      #Method5 - Split row entries, make an identifier and recombine them later : Struggle to recombine
        Meth5 <- strsplit(data, ";")
        i <- 0
        id <- unlist( sapply( Meth5, function(r) rep(i<<-i+1, length(r) ) ) )
        x <- unlist(Meth5, recursive = FALSE )

        x <- list(do.call(rbind, 
               strsplit(gsub(";", 
                             "\1NONSENSESPLIT\2NONSENSESPLIT\3", x),
                        "NONSENSESPLIT")))
        require(data.table)
        data.table( ID=id, do.call(rbind,x))  

      #Method6: Identifies first Author :
        Meth6 <- gsub("[^a-zA-Z0-9 ]","",strsplit(data,"\; ")[[1]][[1]])

任何关于组织和识别 Authors1...AuthorsN 的建议都受到热烈欢迎。

read.csv 对此有支持:

read.csv(text=data,header=FALSE,sep=";")
                     V1                   V2                    V3                 V4                   V5         V6               V7
1            Anselin, L             Varga, A                Acs, Z                                                                    
2               Acs, ZJ           Anselin, L              Varga, A                                                                    
3            Anselin, L                                                                                                               
4             Fujita, M           Thisse, JF                                                                                          
5            Turner, RK  van den Bergh, JCJM         Soderqvist, T      Barendregt, A  van der Straaten, J  Maltby, E  van Ierland, EC
6              Talen, E           Anselin, L                                                                                          
7             Irwin, EG        Bockstael, NE                                                                                          
8           Leggett, CG        Bockstael, NE                                                                                          
9          Guimaraes, P        Figueiredo, O           Woodward, D                                                                    
10 Halpern, Benjamin S.     McLeod, Karen L.  Rosenberg, Andrew A.  Crowder, Larry B.