如何strsplit并在新的df中保留其他变量

Question

抱歉问了这么愚蠢的问题。我有一个看起来像这样的 df:

我想用 data.frame(do.call("rbind", strsplit(as.character(df$tx), "\s{2,}" )), stringsAsFactors=FALSE) 拆分 tx，如何在新的 df 中保留 Form？另外，如果拆分结果为空，如何避免自动填充？

示例 df 可以使用以下方法构建：

df<- structure(list(tx = c(" [1]          Timepoint                                       EGTMPT      Categorical select one (nominal) 51         Screening", 
" [2]          N/A : O ff-Study                                EGTNA       Categorical yes/no (dichotomous) 3", 
" [3]          Check if Not Done                               EGTMPTND    Categorical yes/no (dichotomous) 3", 
" [4]          Date Performed                                  ECGDT       Date                             11", 
" [5]          Time (24-hour format)                           ECGTM       Time                             5", 
" [6]          O verall ECG Interpretation                     ECGRES      Categorical select one (nominal) 37         Normal"
), Form = c("12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)", 
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)", 
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)", 
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)", 
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)", 
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)"
)), row.names = c(NA, 6L), class = "data.frame")

输出将如下所示：

更新：我应该如何以更好的方式分离 tx。我的旧代码似乎会产生错误。示例数据为：

df<-structure(list(tx = c("[6]          O verall ECG Interpretation                     ECGRES      Categorical select one (nominal) 37         Normal", 
"[7]          If A bnormal - Clinically Significant, describe ECGA BN     Text or A ny V alue              200", 
"[8]          PR Interval (ms)                                ECGPRIN     Number (continuous)              15", 
"[1]          Not Done                            PE2ND       Categorical yes/no (dichotomous) 3", 
"[2]          If Not Done, specify reason:        PE2NDR      Text or A ny V alue              200", 
"[4]          Start Date:                                  A ESTDTC    Date                             11", 
"[5]          End Date                                     A EENDTC    Date                             11", 
"[6]          O ngoing:                                    A EO NGO    Categorical yes/no (dichotomous) 3", 
"[7]          Seriousness Criteria: (check all that apply) A ESA E     Categorical select multiple      50", 
"[8]          Severity:                                    A ECTCA E   Categorical select one (nominal) 26         Grade 1 - Mild", 
"[2]          If Not Done, specify reason:                    CHMNO       Text or A ny V alue              200", 
"[6]          Laboratory ID (NO TE: If Lab ID is not present, CHMID       Categorical select one (nominal) 71         Christus Mother Frances Hospital Laboratory", 
"[1]          Has subject had any prior surgery related to the PSYN        Categorical yes/no (dichotomous) 3", 
"[2]          Cycle 1 O nly: If less than the expected number of EXBC1       Categorical select one (nominal) 3", 
"[3]          Cycle                                              EXBCYC      Categorical select one (nominal) 8          Cycle 1", 
"[4]          Dose (mg)                                          EXBDO S     Number (continuous)              15", 
"[5]          Frequency                                          EXBFRQ      Categorical select one (nominal) 3          BID", 
"[6]          Start Date                                         EXBSTDT     Date                             11", 
"[7]          Stop Date                                          EXBENDT     Date                             11", 
"[8]          Reason for End Date/Stopping                       EXBREA      Categorical select one (nominal) 36         Cycle Completed", 
"[9]          O ther Reason (specify)                            EXBREA S    Text or A ny V alue              200"
)), row.names = c(NA, -21L), class = c("tbl_df", "tbl", "data.frame"
))

我得到的输出是：

黄色部分应该在 x3 中。我该怎么办？

Answer 1

您可以使用 splitstackshape::cSplit :

splitstackshape::cSplit(df, 'tx', sep = '\s{2,}', fixed = FALSE)

#                                                                              Form tx_1
#1: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)  [1]
#2: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)  [2]
#3: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)  [3]
#4: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)  [4]
#5: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)  [5]
#6: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)  [6]

#                          tx_2     tx_3                                   tx_4      tx_5
#1:                   Timepoint   EGTMPT Categoricals{2,}elect one (nominal) 51 Screening
#2:            N/A : O ff-Study    EGTNA     Categorical yes/no (dichotomous) 3      <NA>
#3:           Check if Not Done EGTMPTND     Categorical yes/no (dichotomous) 3      <NA>
#4:              Date Performed    ECGDT                                   Date        11
#5:       Time (24-hour format)    ECGTM                                   Time         5
#6: O verall ECG Interpretation   ECGRES Categoricals{2,}elect one (nominal) 37    Normal

与tidyr::separate :

tidyr::separate(df, 'tx', paste0('col', 1:5), sep = '\s{2,}', fill = 'right')

Answer 2

用gsub清洁后可以使用read.csv()。

里面的gsub给右边的数字列多了space，外面的把白色的space变成了逗号sep=','中默认的[=] 11=].

dat_clean <- cbind(read.csv(text=gsub('\s{2,}', ',', gsub('\s+(\d+)', '  \1', trimws(df$tx))),
                            header=F, na.strings=''), Form=df$Form)
dat_clean
# V1                          V2       V3                               V4 V5        V6
# 1 [1]                   Timepoint   EGTMPT Categorical select one (nominal) 51 Screening
# 2 [2]            N/A : O ff-Study    EGTNA Categorical yes/no (dichotomous)  3      <NA>
# 3 [3]           Check if Not Done EGTMPTND Categorical yes/no (dichotomous)  3      <NA>
# 4 [4]              Date Performed    ECGDT                             Date 11      <NA>
# 5 [5]       Time (24-hour format)    ECGTM                             Time  5      <NA>
# 6 [6] O verall ECG Interpretation   ECGRES Categorical select one (nominal) 37    Normal
# Form
# 1 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 2 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 3 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 4 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 5 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 6 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)

如果我们可以使用 read.fwf() 会更好，但它似乎只能从文件中读取。

请注意，如果您的数据中左侧的列缺失，您可能需要稍微调整一下代码。

如何strsplit并在新的df中保留其他变量

how to strsplit and keep other variables in the new df

r

strsplit

dataframe