R如何将具有不同列数的列表导入数据框

Question

我正在尝试从 Scopus csv 文件执行一些科学计量学分析。导入的 csv 的第一列是这样的：

Authors,Title,Year,Source title,Volume,Issue,Art. No.,Page start,Page end,Page count,Cited by,DOI,Link,Document Type,Source,EID
"Kuck, L.S., Noreña, C.P.Z.","Microencapsulation of grape (Vitis labrusca var. Bordo) skin phenolic extract using gum Arabic, polydextrose, and partially hydrolyzed guar gum as encapsulating agents",2016,"Food Chemistry","194",,,"569","576",,,10.1016/j.foodchem.2015.08.066,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940212199&partnerID=40&md5=e4c36e03156570a7fe31c2937b3a170d",Article,Scopus,2-s2.0-84940212199
"Grasel, F.D.S., Ferrão, M.F., Wolf, C.R.","Development of methodology for identification the nature of the polyphenolic extracts by FTIR associated with multivariate analysis",2016,"Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy","153",,,"94","101",,,10.1016/j.saa.2015.08.020,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939865445&partnerID=40&md5=8239487f4eea9479d698792e6aa348de",Article,Scopus,2-s2.0-84939865445
"De Souza, D., Sbardelotto, A.F., Ziegler, D.R., Marczak, L.D.F., Tessaro, I.C.","Characterization of rice starch and protein obtained by a fast alkaline extraction method",2016,"Food Chemistry","191",, 17279,"36","44",,,10.1016/j.foodchem.2015.03.032,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84938952690&partnerID=40&md5=989cbfcc72286a87f726925732db4b49",Article,Scopus,2-s2.0-84938952690
"Filho, P.R.M., Vercelino, R., Cioato, S.G., Medeiros, L.F., de Oliveira, C., Scarabelot, V.L., Souza, A., Rozisky, J.R., Quevedo, A.S., Adachi, L.N.S., Sanches, P.R.S., Fregni, F., Caumo, W., Torres, I.L.S.","Transcranial direct current stimulation (tDCS) reverts behavioral alterations and brainstem BDNF level increase induced by neuropathic pain model: Long-lasting effect",2016,"Progress in Neuro-Psychopharmacology and Biological Psychiatry","64",,,"44","51",,,10.1016/j.pnpbp.2015.06.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84937468588&partnerID=40&md5=b03f0ccfbf66a49a438c9983cc2e8f9d",Article,Scopus,2-s2.0-84937468588
"Duarte, Á.T., Borges, A.R., Zmozinski, A.V., Dessuy, M.B., Welz, B., De Andrade, J.B., Vale, M.G.R.","Determination of lead in biomass and products of the pyrolysis process by direct solid or liquid sample analysis using HR-CS GF AAS",2016,"Talanta","146",,,"166","174",,,10.1016/j.talanta.2015.08.041,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940416990&partnerID=40&md5=55d7ddad27e955b9b6e269469e29c8c3",Article,Scopus,2-s2.0-84940416990
"Francischini, H., Paes Neto, V.D., Martinelli, A.G., Pereira, V.P., Marinho, T.S., Teixeira, V.P.A., Ferraz, M.L.F., Soares, M.B., Schultz, C.L.","Invertebrate traces in pseudo-coprolites from the upper Cretaceous Marília Formation (Bauru Group), Minas Gerais State, Brazil",2016,"Cretaceous Research","57",,,"29","39",,,10.1016/j.cretres.2015.07.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939175950&partnerID=40&md5=b049de15a08ba477cc189d7e8fe7f0a3",Article,Scopus,2-s2.0-84939175950
"Bonfatti, B.R., Hartemink, A.E., Giasson, E., Tornquist, C.G., Adhikari, K.","Digital mapping of soil carbon in a viticultural region of Southern Brazil",2016,"Geoderma","261",,,"204","221",,,10.1016/j.geoderma.2015.07.016,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939499978&partnerID=40&md5=b470166e01648dcbe8f0d43be86c84e0",Article,Scopus,2-s2.0-84939499978
"Scaramuzza dos Santos, T.C., Holanda, E.C., de Souza, V., Guerra-Sommer, M., Manfroi, J., Uhl, D., Jasper, A.","Evidence of palaeo-wildfire from the upper Lower Cretaceous (Serra do Tucano Formation, Aptian-Albian) of Roraima (North Brazil)",2016,"Cretaceous Research","57",,,"46","49",,,10.1016/j.cretres.2015.08.003,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84939615367&partnerID=40&md5=e59f5130c6a2e1863f9aa77c960e6462",Article,Scopus,2-s2.0-84939615367
"da Silva, S.W., Bortolozzi, J.P., Banús, E.D., Bernardes, A.M., Ulla, M.A.","TiO<inf>2</inf> thick films supported on stainless steel foams and their photoactivity in the nonylphenol ethoxylate mineralization",2016,"Chemical Engineering Journal","283",, 14049,"1264","1272",,,10.1016/j.cej.2015.08.057,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940747062&partnerID=40&md5=aebc7357f9dedaadeebabfeda4aa3dd9",Article,Scopus,2-s2.0-84940747062
"Dalmora, A.C., Ramos, C.G., Oliveira, M.L.S., Teixeira, E.C., Kautzmann, R.M., Taffarel, S.R., de Brum, I.A.S., Silva, L.F.O.","Chemical characterization, nano-particle mineralogy and particle size distribution of basalt dust wastes",2016,"Science of the Total Environment","539",, 18331,"560","565",,,10.1016/j.scitotenv.2015.08.141,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84941754626&partnerID=40&md5=1c4ca1a3486ff55f92f238083af3eb50",Article,Scopus,2-s2.0-84941754626
"Fink, J.R., Inda, A.V., Bavaresco, J., Barrón, V., Torrent, J., Bayer, C.","Adsorption and desorption of phosphorus in subtropical soils as affected by management system and mineralogy",2016,"Soil and Tillage Research","155",,,"62","68",,,10.1016/j.still.2015.07.017,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940195225&partnerID=40&md5=2e43a874f1e36f11aa5efa057ce660b9",Article,Scopus,2-s2.0-84940195225
"Martins, A.B., Santana, R.M.C.","Effect of carboxylic acids as compatibilizer agent on mechanical properties of thermoplastic starch and polypropylene blends",2016,"Carbohydrate Polymers","135",,,"79","85",,,10.1016/j.carbpol.2015.08.074,"http://www.scopus.com/inward/record.url?eid=2-s2.0-84940781718&partnerID=40&md5=426e62c6c0de33a91bdb2f75442fbd6f",Article,Scopus,2-s2.0-84940781718

在每一行中，作者的数量是可变的（最多超过 20 个）。到目前为止，我正在做类似的事情：

test <- read.csv("test.csv")
test$Authors <- as.character(test$Authors)
test2 <- strsplit(as.character(test$Authors), '.,', fixed=TRUE)

这给了我一个正确区分每个作者的列表。我测试了列表中提出的几种替代方案，以将列表移动到数据框，但更接近的是：

test3 <- str_split_fixed(test$Authors, '.,', n = 20)

这给了我两个问题：

1）我必须定义列数，我在分析数据之前不知道； 2）作者没有正确分开，但姓氏和缩写在不同的列中。此外，该命令从名称中删除了一些字符。

其他地方建议的一些策略使我能够正确分隔列中的作者，但空列是通过重复初始名称来实现的。不好意思问题太幼稚了，我刚开始用R。

任何建议和/或见解将不胜感激！

Answer 1

这是我的做法。

首先，使用 read.csv 会导致作者的姓氏和首字母分开，所以我改用 readLines。

其次，像这样 "wide data" 通常不是一个好主意。它使数据在后续分析中更难处理。因此，我改为 "long"。

n1 <- readLines(con="test.csv")
n1 <- strsplit(n1, '., ', fixed=TRUE)
n1 <- do.call(rbind, lapply(1:length(n1), function(x){data.frame(aut = n1[[x]], pub = x, order = 1:length(n1[[x]]))}))
n1$aut <- gsub("\.$", "", n1$aut)

这是输出：

               aut pub order
1        Kuck, L.S   1     1
2   NoreÃ±a, C.P.Z   1     2
3    Grasel, F.D.S   2     1
4     FerrÃ£o, M.F   2     2
5        Wolf, C.R   2     3
6       Abreu, M.S   3     1
7 Giacomini, A.C.V   3     2
8         Gusso, D   3     3
9      Rosa, J.G.S   3     4

注意，如果您真的想要 "wide format" 中的数据，我们可以轻松地对其进行整形：

library(tidyr)
spread(n1, order, aut)

  pub             1                2         3           4
1   1     Kuck, L.S   NoreÃ±a, C.P.Z      <NA>        <NA>
2   2 Grasel, F.D.S     FerrÃ£o, M.F Wolf, C.R        <NA>
3   3    Abreu, M.S Giacomini, A.C.V  Gusso, D Rosa, J.G.S

编辑：对于完整版，您需要使用 read.csv:

input <- n1 <- read.csv("test.csv")
n1$Authors <- as.character(n1$Authors)
n1$Authors <- strsplit(n1$Authors, '., ', fixed=TRUE)
n1 <- do.call(rbind, lapply(1:length(n1$Authors), function(x){data.frame(aut = n1$Authors[[x]], pub = x, order = 1:length(n1$Authors[[x]]))}))
n1$aut <- gsub("\.$", "", n1$aut)

如果你想带着你所有的东西回到广角：

library(dplyr)
library(tidyr)
input <- mutate(input, row = row_number())
n1 %>% spread(order, aut) %>%
       left_join(input, by = c("pub" = "row")) %>%
       select(-Authors)

R如何将具有不同列数的列表导入数据框

R how to import a list with different number of columns to a data frame

csv

r

dataframe

scopus