如何strsplit并在新的df中保留其他变量
how to strsplit and keep other variables in the new df
抱歉问了这么愚蠢的问题。我有一个看起来像这样的 df:
我想用 data.frame(do.call("rbind", strsplit(as.character(df$tx), "\s{2,}" )), stringsAsFactors=FALSE)
拆分 tx,如何在新的 df 中保留 Form
?另外,如果拆分结果为空,如何避免自动填充?
示例 df 可以使用以下方法构建:
df<- structure(list(tx = c(" [1] Timepoint EGTMPT Categorical select one (nominal) 51 Screening",
" [2] N/A : O ff-Study EGTNA Categorical yes/no (dichotomous) 3",
" [3] Check if Not Done EGTMPTND Categorical yes/no (dichotomous) 3",
" [4] Date Performed ECGDT Date 11",
" [5] Time (24-hour format) ECGTM Time 5",
" [6] O verall ECG Interpretation ECGRES Categorical select one (nominal) 37 Normal"
), Form = c("12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)",
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)",
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)",
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)",
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)",
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)"
)), row.names = c(NA, 6L), class = "data.frame")
输出将如下所示:
更新:
我应该如何以更好的方式分离 tx。我的旧代码似乎会产生错误。示例数据为:
df<-structure(list(tx = c("[6] O verall ECG Interpretation ECGRES Categorical select one (nominal) 37 Normal",
"[7] If A bnormal - Clinically Significant, describe ECGA BN Text or A ny V alue 200",
"[8] PR Interval (ms) ECGPRIN Number (continuous) 15",
"[1] Not Done PE2ND Categorical yes/no (dichotomous) 3",
"[2] If Not Done, specify reason: PE2NDR Text or A ny V alue 200",
"[4] Start Date: A ESTDTC Date 11",
"[5] End Date A EENDTC Date 11",
"[6] O ngoing: A EO NGO Categorical yes/no (dichotomous) 3",
"[7] Seriousness Criteria: (check all that apply) A ESA E Categorical select multiple 50",
"[8] Severity: A ECTCA E Categorical select one (nominal) 26 Grade 1 - Mild",
"[2] If Not Done, specify reason: CHMNO Text or A ny V alue 200",
"[6] Laboratory ID (NO TE: If Lab ID is not present, CHMID Categorical select one (nominal) 71 Christus Mother Frances Hospital Laboratory",
"[1] Has subject had any prior surgery related to the PSYN Categorical yes/no (dichotomous) 3",
"[2] Cycle 1 O nly: If less than the expected number of EXBC1 Categorical select one (nominal) 3",
"[3] Cycle EXBCYC Categorical select one (nominal) 8 Cycle 1",
"[4] Dose (mg) EXBDO S Number (continuous) 15",
"[5] Frequency EXBFRQ Categorical select one (nominal) 3 BID",
"[6] Start Date EXBSTDT Date 11",
"[7] Stop Date EXBENDT Date 11",
"[8] Reason for End Date/Stopping EXBREA Categorical select one (nominal) 36 Cycle Completed",
"[9] O ther Reason (specify) EXBREA S Text or A ny V alue 200"
)), row.names = c(NA, -21L), class = c("tbl_df", "tbl", "data.frame"
))
我得到的输出是:
黄色部分应该在 x3 中。我该怎么办?
您可以使用 splitstackshape::cSplit
:
splitstackshape::cSplit(df, 'tx', sep = '\s{2,}', fixed = FALSE)
# Form tx_1
#1: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [1]
#2: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [2]
#3: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [3]
#4: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [4]
#5: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [5]
#6: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [6]
# tx_2 tx_3 tx_4 tx_5
#1: Timepoint EGTMPT Categoricals{2,}elect one (nominal) 51 Screening
#2: N/A : O ff-Study EGTNA Categorical yes/no (dichotomous) 3 <NA>
#3: Check if Not Done EGTMPTND Categorical yes/no (dichotomous) 3 <NA>
#4: Date Performed ECGDT Date 11
#5: Time (24-hour format) ECGTM Time 5
#6: O verall ECG Interpretation ECGRES Categoricals{2,}elect one (nominal) 37 Normal
与tidyr::separate
:
tidyr::separate(df, 'tx', paste0('col', 1:5), sep = '\s{2,}', fill = 'right')
用gsub
清洁后可以使用read.csv()
。
里面的gsub
给右边的数字列多了space,外面的把白色的space变成了逗号sep=','
中默认的[=] 11=].
dat_clean <- cbind(read.csv(text=gsub('\s{2,}', ',', gsub('\s+(\d+)', ' \1', trimws(df$tx))),
header=F, na.strings=''), Form=df$Form)
dat_clean
# V1 V2 V3 V4 V5 V6
# 1 [1] Timepoint EGTMPT Categorical select one (nominal) 51 Screening
# 2 [2] N/A : O ff-Study EGTNA Categorical yes/no (dichotomous) 3 <NA>
# 3 [3] Check if Not Done EGTMPTND Categorical yes/no (dichotomous) 3 <NA>
# 4 [4] Date Performed ECGDT Date 11 <NA>
# 5 [5] Time (24-hour format) ECGTM Time 5 <NA>
# 6 [6] O verall ECG Interpretation ECGRES Categorical select one (nominal) 37 Normal
# Form
# 1 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 2 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 3 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 4 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 5 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 6 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
如果我们可以使用 read.fwf()
会更好,但它似乎只能从文件中读取。
请注意,如果您的数据中左侧的列缺失,您可能需要稍微调整一下代码。
抱歉问了这么愚蠢的问题。我有一个看起来像这样的 df:
我想用 data.frame(do.call("rbind", strsplit(as.character(df$tx), "\s{2,}" )), stringsAsFactors=FALSE)
拆分 tx,如何在新的 df 中保留 Form
?另外,如果拆分结果为空,如何避免自动填充?
示例 df 可以使用以下方法构建:
df<- structure(list(tx = c(" [1] Timepoint EGTMPT Categorical select one (nominal) 51 Screening",
" [2] N/A : O ff-Study EGTNA Categorical yes/no (dichotomous) 3",
" [3] Check if Not Done EGTMPTND Categorical yes/no (dichotomous) 3",
" [4] Date Performed ECGDT Date 11",
" [5] Time (24-hour format) ECGTM Time 5",
" [6] O verall ECG Interpretation ECGRES Categorical select one (nominal) 37 Normal"
), Form = c("12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)",
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)",
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)",
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)",
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)",
"12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)"
)), row.names = c(NA, 6L), class = "data.frame")
输出将如下所示:
更新: 我应该如何以更好的方式分离 tx。我的旧代码似乎会产生错误。示例数据为:
df<-structure(list(tx = c("[6] O verall ECG Interpretation ECGRES Categorical select one (nominal) 37 Normal",
"[7] If A bnormal - Clinically Significant, describe ECGA BN Text or A ny V alue 200",
"[8] PR Interval (ms) ECGPRIN Number (continuous) 15",
"[1] Not Done PE2ND Categorical yes/no (dichotomous) 3",
"[2] If Not Done, specify reason: PE2NDR Text or A ny V alue 200",
"[4] Start Date: A ESTDTC Date 11",
"[5] End Date A EENDTC Date 11",
"[6] O ngoing: A EO NGO Categorical yes/no (dichotomous) 3",
"[7] Seriousness Criteria: (check all that apply) A ESA E Categorical select multiple 50",
"[8] Severity: A ECTCA E Categorical select one (nominal) 26 Grade 1 - Mild",
"[2] If Not Done, specify reason: CHMNO Text or A ny V alue 200",
"[6] Laboratory ID (NO TE: If Lab ID is not present, CHMID Categorical select one (nominal) 71 Christus Mother Frances Hospital Laboratory",
"[1] Has subject had any prior surgery related to the PSYN Categorical yes/no (dichotomous) 3",
"[2] Cycle 1 O nly: If less than the expected number of EXBC1 Categorical select one (nominal) 3",
"[3] Cycle EXBCYC Categorical select one (nominal) 8 Cycle 1",
"[4] Dose (mg) EXBDO S Number (continuous) 15",
"[5] Frequency EXBFRQ Categorical select one (nominal) 3 BID",
"[6] Start Date EXBSTDT Date 11",
"[7] Stop Date EXBENDT Date 11",
"[8] Reason for End Date/Stopping EXBREA Categorical select one (nominal) 36 Cycle Completed",
"[9] O ther Reason (specify) EXBREA S Text or A ny V alue 200"
)), row.names = c(NA, -21L), class = c("tbl_df", "tbl", "data.frame"
))
我得到的输出是:
黄色部分应该在 x3 中。我该怎么办?
您可以使用 splitstackshape::cSplit
:
splitstackshape::cSplit(df, 'tx', sep = '\s{2,}', fixed = FALSE)
# Form tx_1
#1: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [1]
#2: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [2]
#3: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [3]
#4: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [4]
#5: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [5]
#6: 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK) [6]
# tx_2 tx_3 tx_4 tx_5
#1: Timepoint EGTMPT Categoricals{2,}elect one (nominal) 51 Screening
#2: N/A : O ff-Study EGTNA Categorical yes/no (dichotomous) 3 <NA>
#3: Check if Not Done EGTMPTND Categorical yes/no (dichotomous) 3 <NA>
#4: Date Performed ECGDT Date 11
#5: Time (24-hour format) ECGTM Time 5
#6: O verall ECG Interpretation ECGRES Categoricals{2,}elect one (nominal) 37 Normal
与tidyr::separate
:
tidyr::separate(df, 'tx', paste0('col', 1:5), sep = '\s{2,}', fill = 'right')
用gsub
清洁后可以使用read.csv()
。
里面的gsub
给右边的数字列多了space,外面的把白色的space变成了逗号sep=','
中默认的[=] 11=].
dat_clean <- cbind(read.csv(text=gsub('\s{2,}', ',', gsub('\s+(\d+)', ' \1', trimws(df$tx))),
header=F, na.strings=''), Form=df$Form)
dat_clean
# V1 V2 V3 V4 V5 V6
# 1 [1] Timepoint EGTMPT Categorical select one (nominal) 51 Screening
# 2 [2] N/A : O ff-Study EGTNA Categorical yes/no (dichotomous) 3 <NA>
# 3 [3] Check if Not Done EGTMPTND Categorical yes/no (dichotomous) 3 <NA>
# 4 [4] Date Performed ECGDT Date 11 <NA>
# 5 [5] Time (24-hour format) ECGTM Time 5 <NA>
# 6 [6] O verall ECG Interpretation ECGRES Categorical select one (nominal) 37 Normal
# Form
# 1 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 2 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 3 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 4 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 5 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
# 6 12-Lead Electrocardiogram (EG) at Log Pages (Dosing, ECG, PBMC, Biomarkers, PK)
如果我们可以使用 read.fwf()
会更好,但它似乎只能从文件中读取。
请注意,如果您的数据中左侧的列缺失,您可能需要稍微调整一下代码。