从凌乱的字符列表到 R 中的矩阵
From a messy character list to a matrix in R
非常感谢您的帮助。我有一个大向量,其中包含 2000 个不同长度的字符串,我是从 Web of Science 中检索到的。我的数据集可以下载here.
数据结构和结果。
此向量的每一行都有不同的“长度”,但模式相同。 “[]”内的字符决定行数,外面的字符决定列数。我将用这三行做一个例子:
[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy
[Allema, Bas; Hemerik, Lia; Rossing, Walter A. H.] Wageningen Univ, NL-6700 AP Wageningen, Netherlands; [Allema, Bas; van Lenteren, Joop C.] Wageningen Univ, Entomol Lab, NL-6700 AP Wageningen, Netherlands; [van der Werf, Wopke] Wageningen Univ, Ctr Crop Syst Anal, Crop & Weed Ecol Grp, NL-6700 AP Wageningen, Netherlands
[Abdissa, Ketema; Tadesse, Mulualem; Bezabih, Mesele; Bekele, Alemayehu; Abebe, Gemeda] Jimma Univ, Dept Med Lab Sci & Pathol, Jimma, Ethiopia; [Apers, Ludwig] Inst Trop Med, Dept Clin Sci, B-2000 Antwerp, Belgium; [Rigouts, Leen] Inst Trop Med, Dept Microbiol, Mycobacteriol Unit, B-2000 Antwerp, Belgium
第一行在“[]”中有2组,每组5列;第二行有两组,一组有 3 列,第二组有 4 列;第三行有 3 组,每组有 4、4 和 5 列。
结果将是这样的矩阵:
ID Author Info01 Info02 Info03 Info04 Info05
1 Sorce, A Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Greco, A. Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Magistri, L. Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Costamagna, P. Univ Genoa Polytech Sch Thermochem Power Grp TPG DICCA I-16145 Genoa Italy
2 Allema Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Bas; Hemerik Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Lia; Rossing Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Walter A. H. Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Allema, Bas Wageningen Univ Entomol Lab NL-6700 AP Wageningen Netherlands N/A
2 van Lenteren, Joop C. Wageningen Univ Entomol Lab NL-6700 AP Wageningen Netherlands N/A
2 van der Werf, Wopke Wageningen Univ Ctr Crop Syst Anal Crop & Weed Ecol Grp NL-6700 AP Wageningen Netherlands
3 Abdissa, Ketema Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Tadesse, Mulualem Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Bezabih, Mesele Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Bekele, Alemayehu Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Abebe, Gemeda Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Apers, Ludwig Inst Trop Med Dept Clin Sci B-2000 Antwerp Belgium N/A
3 Rigouts, Leen Inst Trop Med Dept Microbiol Mycobacteriol Unit B-2000 Antwerp Belgium
我的方法
使用此命令分隔字符串并将向量转换为列表:
CL1 <- str_split(CL, "\[|\]", n= Inf)
这会生成一个包含如下字符的向量列表:
[[1999]]
[1] ""
[2] "Zhuo, Hongying; Li, Qingzhong; Li, Wenzuo; Cheng, Jianbo"
[3] " Yantai Univ, Sch Chem & Chem Engn, Lab Theoret & Computat Chem, Yantai 264005, Peoples R China"
[[2000]]
[1] ""
[2] "Zuo, Li; Meng, Qing-Hong; Chung, Peter Chee-Keung"
[3] " Guiyang Med Coll, Dept Immunol, Guiyang 550004, Guizhou Provinc, Peoples R China; "
[4] "Yuan, Kai-Tao"
[5] " Sun Yat Sen Univ, Affiliated Hosp 1, Dept Surg, Guangzhou 510080, Guangdong, Peoples R China; "
[6] "Yu, Li"
[7] " Guangzhou First Municipal Peoples Hosp, Dept Paediat, Guangzhou 510180, Guangdong, Peoples R China; "
[8] "Yang, Ding-Hua"
[9] " Southern Med Univ, Nan Fang Hosp, Dept Hepatobiliary Surg, Guangzhou 510515, Guangdong, Peoples R China"
如您所见,列表中每个向量的第一个元素都是空白的。向量的每个“偶数”元素包含“组”,每个“奇数”元素包含该组的列。
下一步是将组分隔为 assemble 一个矩阵,为此我正在使用这两个命令。
CL2 <- lapply(CL1,function(x)x[2])
AF1 <- lapply(CL1,function(x)x[3])
因为在某些情况下我在同一行中有超过 50 个组,基本上我必须循环重复这个过程,但我不知道如何,现在我正在手动进行。另一个问题是我不知道如何创建ID以及如何将列表合并为矩阵。
欢迎提出任何想法或建议。
你可以用正则表达式做一些各种操作,用plyr
和foreach
函数来处理一切。这是第一行的示例
library(foreach)
library(plyr)
str1 = '[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy'
##split the string into different parts
s1 = strsplit(str1,'; \[')
s1. = llply(s1,strsplit,split = ']')[[1]]
##get list of authors
auths = llply(s1.,function(x) gsub('^ ','',strsplit(gsub('\[','',x[1]),';')[[1]]))
##get all other attributes
other.stuff = llply(s1.,function(x) gsub('^ ','',strsplit(x[2],',')[[1]]))
results = foreach(auth = auths, other = other.stuff, .combine = 'rbind') %do%
expand.grid(auth,other[1],other[2],other[3],other[4],other[5])
需要更改输出的列名,您需要为每一行重复此操作,但这应该很容易。
下面应该做你想实现的:
A <- read.csv("AU.csv", stringsAsFactors = FALSE)
## One vector with all of the data in square brackets
A1 <- regmatches(A[[2]], gregexpr("\[.*?\]", A[[2]]))
LA1 <- lengths(A1)
A1 <- gsub("\[|\]", "", unlist(A1))
## One vector with all of the other data
A2 <- regmatches(A[[2]], gregexpr("\[.*?\]", A[[2]]), invert = TRUE)
LA2 <- lengths(A2) - 1
A2 <- unlist(lapply(A2, function(x) gsub("^\s+|\s+$|;\s+$", "", x[-1])))
## Checking for mistakes....
all.equal(LA1, LA2)
# [1] TRUE
all.equal(sum(LA1), length(A1))
# [1] TRUE
现在我们有了向量,我们可以使用我的 "splitstackshape" 包中的 cSplit
来获得您想要的输出:
library(splitstackshape)
library(magrittr)
## Make a data.table of the two vectors and the ID column
DT <- data.table(ID = rep(A[[1]], LA1), A1, A2)
## Here's the splitting....
final <- DT %>%
cSplit("A1", ";", "long") %>% ## The first column is split and made long
cSplit("A2", ",") ## The second column is split and made wide
结果如下:
final
# ID A1 A2_01 A2_02
# 1: 1 Aalten, Pauline Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 2: 1 Ramakers, Inez H. G. B. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 3: 1 Rozendaal, Nico Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 4: 1 Verhey, Frans R. J. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 5: 1 Biessels, Geert Jan Univ Med Ctr Utrecht Dept Neurol
# ---
# 13949: 2000 Meng, Qing-Hong Guiyang Med Coll Dept Immunol
# 13950: 2000 Chung, Peter Chee-Keung Guiyang Med Coll Dept Immunol
# 13951: 2000 Yuan, Kai-Tao Sun Yat Sen Univ Affiliated Hosp 1
# 13952: 2000 Yu, Li Guangzhou First Municipal Peoples Hosp Dept Paediat
# 13953: 2000 Yang, Ding-Hua Southern Med Univ Nan Fang Hosp
# A2_03 A2_04 A2_05 A2_06 A2_07 A2_08 A2_09 A2_10
# 1: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 2: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 3: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 4: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 5: Utrecht Netherlands NA NA NA NA NA NA
# ---
# 13949: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13950: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13951: Dept Surg Guangzhou 510080 Guangdong Peoples R China NA NA NA NA
# 13952: Guangzhou 510180 Guangdong Peoples R China NA NA NA NA NA
# 13953: Dept Hepatobiliary Surg Guangzhou 510515 Guangdong Peoples R China NA NA NA NA
非常感谢您的帮助。我有一个大向量,其中包含 2000 个不同长度的字符串,我是从 Web of Science 中检索到的。我的数据集可以下载here.
数据结构和结果。
此向量的每一行都有不同的“长度”,但模式相同。 “[]”内的字符决定行数,外面的字符决定列数。我将用这三行做一个例子:
[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy
[Allema, Bas; Hemerik, Lia; Rossing, Walter A. H.] Wageningen Univ, NL-6700 AP Wageningen, Netherlands; [Allema, Bas; van Lenteren, Joop C.] Wageningen Univ, Entomol Lab, NL-6700 AP Wageningen, Netherlands; [van der Werf, Wopke] Wageningen Univ, Ctr Crop Syst Anal, Crop & Weed Ecol Grp, NL-6700 AP Wageningen, Netherlands
[Abdissa, Ketema; Tadesse, Mulualem; Bezabih, Mesele; Bekele, Alemayehu; Abebe, Gemeda] Jimma Univ, Dept Med Lab Sci & Pathol, Jimma, Ethiopia; [Apers, Ludwig] Inst Trop Med, Dept Clin Sci, B-2000 Antwerp, Belgium; [Rigouts, Leen] Inst Trop Med, Dept Microbiol, Mycobacteriol Unit, B-2000 Antwerp, Belgium
第一行在“[]”中有2组,每组5列;第二行有两组,一组有 3 列,第二组有 4 列;第三行有 3 组,每组有 4、4 和 5 列。
结果将是这样的矩阵:
ID Author Info01 Info02 Info03 Info04 Info05
1 Sorce, A Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Greco, A. Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Magistri, L. Univ Genoa Polytech Sch Thermochem Power Grp TPG DIME I-16145 Genoa Italy
1 Costamagna, P. Univ Genoa Polytech Sch Thermochem Power Grp TPG DICCA I-16145 Genoa Italy
2 Allema Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Bas; Hemerik Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Lia; Rossing Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Walter A. H. Wageningen Univ NL-6700 AP Wageningen Netherlands N/A N/A
2 Allema, Bas Wageningen Univ Entomol Lab NL-6700 AP Wageningen Netherlands N/A
2 van Lenteren, Joop C. Wageningen Univ Entomol Lab NL-6700 AP Wageningen Netherlands N/A
2 van der Werf, Wopke Wageningen Univ Ctr Crop Syst Anal Crop & Weed Ecol Grp NL-6700 AP Wageningen Netherlands
3 Abdissa, Ketema Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Tadesse, Mulualem Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Bezabih, Mesele Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Bekele, Alemayehu Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Abebe, Gemeda Jimma Univ Dept Med Lab Sci & Pathol Jimma Ethiopia N/A
3 Apers, Ludwig Inst Trop Med Dept Clin Sci B-2000 Antwerp Belgium N/A
3 Rigouts, Leen Inst Trop Med Dept Microbiol Mycobacteriol Unit B-2000 Antwerp Belgium
我的方法
使用此命令分隔字符串并将向量转换为列表:
CL1 <- str_split(CL, "\[|\]", n= Inf)
这会生成一个包含如下字符的向量列表:
[[1999]]
[1] ""
[2] "Zhuo, Hongying; Li, Qingzhong; Li, Wenzuo; Cheng, Jianbo"
[3] " Yantai Univ, Sch Chem & Chem Engn, Lab Theoret & Computat Chem, Yantai 264005, Peoples R China"
[[2000]]
[1] ""
[2] "Zuo, Li; Meng, Qing-Hong; Chung, Peter Chee-Keung"
[3] " Guiyang Med Coll, Dept Immunol, Guiyang 550004, Guizhou Provinc, Peoples R China; "
[4] "Yuan, Kai-Tao"
[5] " Sun Yat Sen Univ, Affiliated Hosp 1, Dept Surg, Guangzhou 510080, Guangdong, Peoples R China; "
[6] "Yu, Li"
[7] " Guangzhou First Municipal Peoples Hosp, Dept Paediat, Guangzhou 510180, Guangdong, Peoples R China; "
[8] "Yang, Ding-Hua"
[9] " Southern Med Univ, Nan Fang Hosp, Dept Hepatobiliary Surg, Guangzhou 510515, Guangdong, Peoples R China"
如您所见,列表中每个向量的第一个元素都是空白的。向量的每个“偶数”元素包含“组”,每个“奇数”元素包含该组的列。
下一步是将组分隔为 assemble 一个矩阵,为此我正在使用这两个命令。
CL2 <- lapply(CL1,function(x)x[2])
AF1 <- lapply(CL1,function(x)x[3])
因为在某些情况下我在同一行中有超过 50 个组,基本上我必须循环重复这个过程,但我不知道如何,现在我正在手动进行。另一个问题是我不知道如何创建ID以及如何将列表合并为矩阵。
欢迎提出任何想法或建议。
你可以用正则表达式做一些各种操作,用plyr
和foreach
函数来处理一切。这是第一行的示例
library(foreach)
library(plyr)
str1 = '[Sorce, A.; Greco, A.; Magistri, L.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DIME, I-16145 Genoa, Italy; [Costamagna, P.] Univ Genoa, Polytech Sch, Thermochem Power Grp TPG DICCA, I-16145 Genoa, Italy'
##split the string into different parts
s1 = strsplit(str1,'; \[')
s1. = llply(s1,strsplit,split = ']')[[1]]
##get list of authors
auths = llply(s1.,function(x) gsub('^ ','',strsplit(gsub('\[','',x[1]),';')[[1]]))
##get all other attributes
other.stuff = llply(s1.,function(x) gsub('^ ','',strsplit(x[2],',')[[1]]))
results = foreach(auth = auths, other = other.stuff, .combine = 'rbind') %do%
expand.grid(auth,other[1],other[2],other[3],other[4],other[5])
需要更改输出的列名,您需要为每一行重复此操作,但这应该很容易。
下面应该做你想实现的:
A <- read.csv("AU.csv", stringsAsFactors = FALSE)
## One vector with all of the data in square brackets
A1 <- regmatches(A[[2]], gregexpr("\[.*?\]", A[[2]]))
LA1 <- lengths(A1)
A1 <- gsub("\[|\]", "", unlist(A1))
## One vector with all of the other data
A2 <- regmatches(A[[2]], gregexpr("\[.*?\]", A[[2]]), invert = TRUE)
LA2 <- lengths(A2) - 1
A2 <- unlist(lapply(A2, function(x) gsub("^\s+|\s+$|;\s+$", "", x[-1])))
## Checking for mistakes....
all.equal(LA1, LA2)
# [1] TRUE
all.equal(sum(LA1), length(A1))
# [1] TRUE
现在我们有了向量,我们可以使用我的 "splitstackshape" 包中的 cSplit
来获得您想要的输出:
library(splitstackshape)
library(magrittr)
## Make a data.table of the two vectors and the ID column
DT <- data.table(ID = rep(A[[1]], LA1), A1, A2)
## Here's the splitting....
final <- DT %>%
cSplit("A1", ";", "long") %>% ## The first column is split and made long
cSplit("A2", ",") ## The second column is split and made wide
结果如下:
final
# ID A1 A2_01 A2_02
# 1: 1 Aalten, Pauline Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 2: 1 Ramakers, Inez H. G. B. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 3: 1 Rozendaal, Nico Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 4: 1 Verhey, Frans R. J. Maastricht Univ Med Ctr Sch Mental Hlth & Neurosci MHeNS
# 5: 1 Biessels, Geert Jan Univ Med Ctr Utrecht Dept Neurol
# ---
# 13949: 2000 Meng, Qing-Hong Guiyang Med Coll Dept Immunol
# 13950: 2000 Chung, Peter Chee-Keung Guiyang Med Coll Dept Immunol
# 13951: 2000 Yuan, Kai-Tao Sun Yat Sen Univ Affiliated Hosp 1
# 13952: 2000 Yu, Li Guangzhou First Municipal Peoples Hosp Dept Paediat
# 13953: 2000 Yang, Ding-Hua Southern Med Univ Nan Fang Hosp
# A2_03 A2_04 A2_05 A2_06 A2_07 A2_08 A2_09 A2_10
# 1: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 2: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 3: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 4: Alzheimer Ctr Limburg NL-6200 MD Maastricht Netherlands NA NA NA NA NA
# 5: Utrecht Netherlands NA NA NA NA NA NA
# ---
# 13949: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13950: Guiyang 550004 Guizhou Provinc Peoples R China NA NA NA NA NA
# 13951: Dept Surg Guangzhou 510080 Guangdong Peoples R China NA NA NA NA
# 13952: Guangzhou 510180 Guangdong Peoples R China NA NA NA NA NA
# 13953: Dept Hepatobiliary Surg Guangzhou 510515 Guangdong Peoples R China NA NA NA NA