从凌乱的字符列表到 R 中的矩阵

From a messy character list to a matrix in R

非常感谢您的帮助。我有一个大向量,其中包含 2000 个不同长度的字符串,我是从 Web of Science 中检索到的。我的数据集可以下载here.

数据结构和结果。

此向量的每一行都有不同的“长度”,但模式相同。 “[]”内的字符决定行数,外面的字符决定列数。我将用这三行做一个例子:

[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy
[Allema, Bas; Hemerik, Lia; Rossing, Walter A. H.] Wageningen Univ, NL-6700 AP Wageningen, Netherlands; [Allema, Bas; van Lenteren, Joop C.] Wageningen Univ, Entomol Lab, NL-6700 AP Wageningen, Netherlands; [van der Werf, Wopke] Wageningen Univ, Ctr Crop Syst Anal, Crop & Weed Ecol Grp, NL-6700 AP Wageningen, Netherlands
[Abdissa, Ketema; Tadesse, Mulualem; Bezabih, Mesele; Bekele, Alemayehu; Abebe, Gemeda] Jimma Univ, Dept Med Lab Sci & Pathol, Jimma, Ethiopia; [Apers, Ludwig] Inst Trop Med, Dept Clin Sci, B-2000 Antwerp, Belgium; [Rigouts, Leen] Inst Trop Med, Dept Microbiol, Mycobacteriol Unit, B-2000 Antwerp, Belgium

第一行在“[]”中有2组,每组5列;第二行有两组,一组有 3 列,第二组有 4 列;第三行有 3 组,每组有 4、4 和 5 列。

结果将是这样的矩阵:

ID  Author  Info01  Info02  Info03  Info04  Info05
1   Sorce, A    Univ Genoa   Polytech Sch    Thermochem Power Grp TPG DIME   I-16145 Genoa   Italy
1   Greco, A.   Univ Genoa   Polytech Sch    Thermochem Power Grp TPG DIME   I-16145 Genoa   Italy
1   Magistri, L.    Univ Genoa   Polytech Sch    Thermochem Power Grp TPG DIME   I-16145 Genoa   Italy
1   Costamagna, P.  Univ Genoa   Polytech Sch   Thermochem Power Grp TPG DICCA   I-16145 Genoa   Italy
2   Allema  Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Bas; Hemerik    Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Lia; Rossing    Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Walter A. H.    Wageningen Univ  NL-6700 AP Wageningen   Netherlands    N/A N/A
2   Allema, Bas Wageningen Univ  Entomol Lab     NL-6700 AP Wageningen   Netherlands    N/A
2   van Lenteren, Joop C.   Wageningen Univ  Entomol Lab     NL-6700 AP Wageningen   Netherlands    N/A
2   van der Werf, Wopke Wageningen Univ  Ctr Crop Syst Anal  Crop & Weed Ecol Grp    NL-6700 AP Wageningen   Netherlands
3   Abdissa, Ketema  Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Tadesse, Mulualem    Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Bezabih, Mesele  Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Bekele, Alemayehu    Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Abebe, Gemeda    Jimma Univ  Dept Med Lab Sci & Pathol   Jimma   Ethiopia   N/A
3   Apers, Ludwig    Inst Trop Med   Dept Clin Sci   B-2000 Antwerp  Belgium    N/A
3   Rigouts, Leen    Inst Trop Med   Dept Microbiol  Mycobacteriol Unit  B-2000 Antwerp  Belgium

我的方法

使用此命令分隔字符串并将向量转换为列表:

CL1 <- str_split(CL, "\[|\]", n= Inf)

这会生成一个包含如下字符的向量列表:

[[1999]]
[1] ""                                                                                               
[2] "Zhuo, Hongying; Li, Qingzhong; Li, Wenzuo; Cheng, Jianbo"                                       
[3] " Yantai Univ, Sch Chem & Chem Engn, Lab Theoret & Computat Chem, Yantai 264005, Peoples R China"

[[2000]]
[1] ""                                                                                                        
[2] "Zuo, Li; Meng, Qing-Hong; Chung, Peter Chee-Keung"                                                       
[3] " Guiyang Med Coll, Dept Immunol, Guiyang 550004, Guizhou Provinc, Peoples R China; "                     
[4] "Yuan, Kai-Tao"                                                                                           
[5] " Sun Yat Sen Univ, Affiliated Hosp 1, Dept Surg, Guangzhou 510080, Guangdong, Peoples R China; "         
[6] "Yu, Li"                                                                                                  
[7] " Guangzhou First Municipal Peoples Hosp, Dept Paediat, Guangzhou 510180, Guangdong, Peoples R China; "   
[8] "Yang, Ding-Hua"                                                                                          
[9] " Southern Med Univ, Nan Fang Hosp, Dept Hepatobiliary Surg, Guangzhou 510515, Guangdong, Peoples R China"

如您所见,列表中每个向量的第一个元素都是空白的。向量的每个“偶数”元素包含“组”,每个“奇数”元素包含该组的列。

下一步是将组分隔为 assemble 一个矩阵,为此我正在使用这两个命令。

CL2 <- lapply(CL1,function(x)x[2])

AF1 <- lapply(CL1,function(x)x[3])

因为在某些情况下我在同一行中有超过 50 个组,基本上我必须循环重复这个过程,但我不知道如何,现在我正在手动进行。另一个问题是我不知道如何创建ID以及如何将列表合并为矩阵。

欢迎提出任何想法或建议。

你可以用正则表达式做一些各种操作,用plyrforeach函数来处理一切。这是第一行的示例

library(foreach)
library(plyr)
str1 = '[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy'

##split the string into different parts
s1 = strsplit(str1,'; \[')
s1. = llply(s1,strsplit,split = ']')[[1]]

##get list of authors
auths = llply(s1.,function(x) gsub('^ ','',strsplit(gsub('\[','',x[1]),';')[[1]]))
##get all other attributes
other.stuff = llply(s1.,function(x) gsub('^ ','',strsplit(x[2],',')[[1]]))

results = foreach(auth = auths, other = other.stuff, .combine = 'rbind') %do%
 expand.grid(auth,other[1],other[2],other[3],other[4],other[5])

需要更改输出的列名,您需要为每一行重复此操作,但这应该很容易。

下面应该做你想实现的:

A <- read.csv("AU.csv", stringsAsFactors = FALSE)

## One vector with all of the data in square brackets
A1 <- regmatches(A[[2]], gregexpr("\[.*?\]", A[[2]]))
LA1 <- lengths(A1)

A1 <- gsub("\[|\]", "", unlist(A1))

## One vector with all of the other data
A2 <- regmatches(A[[2]], gregexpr("\[.*?\]", A[[2]]), invert = TRUE)
LA2 <- lengths(A2) - 1

A2 <- unlist(lapply(A2, function(x) gsub("^\s+|\s+$|;\s+$", "", x[-1])))

## Checking for mistakes....
all.equal(LA1, LA2)
# [1] TRUE
all.equal(sum(LA1), length(A1))
# [1] TRUE

现在我们有了向量,我们可以使用我的 "splitstackshape" 包中的 cSplit 来获得您想要的输出:

library(splitstackshape)
library(magrittr)

## Make a data.table of the two vectors and the ID column
DT <- data.table(ID = rep(A[[1]], LA1), A1, A2)

## Here's the splitting....
final <- DT %>% 
  cSplit("A1", ";", "long") %>%  ## The first column is split and made long
  cSplit("A2", ",")              ## The second column is split and made wide

结果如下:

final
#          ID                      A1                                  A2_01                            A2_02
#     1:    1         Aalten, Pauline                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     2:    1 Ramakers, Inez H. G. B.                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     3:    1         Rozendaal, Nico                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     4:    1     Verhey, Frans R. J.                Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
#     5:    1     Biessels, Geert Jan                   Univ Med Ctr Utrecht                      Dept Neurol
#    ---                                                                                                     
# 13949: 2000         Meng, Qing-Hong                       Guiyang Med Coll                     Dept Immunol
# 13950: 2000 Chung, Peter Chee-Keung                       Guiyang Med Coll                     Dept Immunol
# 13951: 2000           Yuan, Kai-Tao                       Sun Yat Sen Univ                Affiliated Hosp 1
# 13952: 2000                  Yu, Li Guangzhou First Municipal Peoples Hosp                     Dept Paediat
# 13953: 2000          Yang, Ding-Hua                      Southern Med Univ                    Nan Fang Hosp
#                          A2_03                 A2_04           A2_05           A2_06 A2_07 A2_08 A2_09 A2_10
#     1:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     2:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     3:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     4:   Alzheimer Ctr Limburg NL-6200 MD Maastricht     Netherlands              NA    NA    NA    NA    NA
#     5:                 Utrecht           Netherlands              NA              NA    NA    NA    NA    NA
#    ---                                                                                                      
# 13949:          Guiyang 550004       Guizhou Provinc Peoples R China              NA    NA    NA    NA    NA
# 13950:          Guiyang 550004       Guizhou Provinc Peoples R China              NA    NA    NA    NA    NA
# 13951:               Dept Surg      Guangzhou 510080       Guangdong Peoples R China    NA    NA    NA    NA
# 13952:        Guangzhou 510180             Guangdong Peoples R China              NA    NA    NA    NA    NA
# 13953: Dept Hepatobiliary Surg      Guangzhou 510515       Guangdong Peoples R China    NA    NA    NA    NA