tidyr 将具有字符和数值的列拆分为 R 中的两个单独的列
tidyr split a column with character and numerical values into two separate columns in R
我有一个数据集,其中有一个 offense
列,其中包含 offense
描述及其相关的攻击 code
。攻击代码有时完全在 numeric
中,有时是 numeric
和 character
.
的组合
如何使用 R
中的 tidyr
将此列分成两列,一列用于 offense code
,一列用于 offense description
?
示例数据列:
Crime
123 Crime Description A
345 Crime Description B
678 Crime Description C
91011 Crime Description D
678(a)(1) Crime Description E
345(a)(32)(i) Crime Description F
143(a)(16) Crime Description G
678.08(a) Crime Description H
976.D1 Crime Description I
您可以在此处使用 sub
:
Crime$offense_code <- sub("^(\d+(?:\.\w+)?(?:\(.*?\))*) .*$", "\1", Crime$data)
Crime$offense_desc <- sub("^\d+(?:\.\w+)?(?:\(.*?\))* (.*)$", "\1", Crime$data)
Crime
data offense_code offense_desc
1 123 Crime Description A 123 Crime Description A
2 345 Crime Description B 345 Crime Description B
3 678 Crime Description C 678 Crime Description C
4 91011 Crime Description D 91011 Crime Description D
5 678(a)(1) Crime Description E 678(a)(1) Crime Description E
6 345(a)(32)(i) Crime Description F 345(a)(32)(i) Crime Description F
7 143(a)(16) Crime Description G 143(a)(16) Crime Description G
8 678.08(a) Crime Description H 678.08(a) Crime Description H
9 976.D1 Crime Description I 976.D1 Crime Description I
此处使用的通用正则表达式表示匹配:
^ from the start of the data field
\d+ an integer
(?:\.\w+)? followed by optional dot and word component
(?:\(.*?\))* followed by zero or more (...) terms
[ ] a single space
.* then match the entire description
$ until the end of the data field
您可以在第一个空格处拆分。使用 tidyr::separate
你可以使用 -
tidyr::separate(df, 'Crime', c('offense_code', 'offense_description'),
sep = '\s', extra = 'merge')
# offense_code offense_description
#1 123 Crime Description A
#2 345 Crime Description B
#3 678 Crime Description C
#4 91011 Crime Description D
#5 678(a)(1) Crime Description E
#6 345(a)(32)(i) Crime Description F
#7 143(a)(16) Crime Description G
#8 678.08(a) Crime Description H
#9 976.D1 Crime Description I
如果要在输出中保留原始列,请添加 remove = FALSE
。
使用 base R
中的 read.csv
read.csv(text = sub("\s+", ",", df1$Crime), header = FALSE, col.names = c('offense_code', 'offense_description'))
offense_code offense_description
1 123 Crime Description A
2 345 Crime Description B
3 678 Crime Description C
4 91011 Crime Description D
5 678(a)(1) Crime Description E
6 345(a)(32)(i) Crime Description F
7 143(a)(16) Crime Description G
8 678.08(a) Crime Description H
9 976.D1 Crime Description I
数据
df1 <- structure(list(Crime = c("123 Crime Description A", "345 Crime Description B",
"678 Crime Description C", "91011 Crime Description D", "678(a)(1) Crime Description E",
"345(a)(32)(i) Crime Description F", "143(a)(16) Crime Description G",
"678.08(a) Crime Description H", "976.D1 Crime Description I"
)), class = "data.frame", row.names = c(NA, -9L))
我有一个数据集,其中有一个 offense
列,其中包含 offense
描述及其相关的攻击 code
。攻击代码有时完全在 numeric
中,有时是 numeric
和 character
.
如何使用 R
中的 tidyr
将此列分成两列,一列用于 offense code
,一列用于 offense description
?
示例数据列:
Crime
123 Crime Description A
345 Crime Description B
678 Crime Description C
91011 Crime Description D
678(a)(1) Crime Description E
345(a)(32)(i) Crime Description F
143(a)(16) Crime Description G
678.08(a) Crime Description H
976.D1 Crime Description I
您可以在此处使用 sub
:
Crime$offense_code <- sub("^(\d+(?:\.\w+)?(?:\(.*?\))*) .*$", "\1", Crime$data)
Crime$offense_desc <- sub("^\d+(?:\.\w+)?(?:\(.*?\))* (.*)$", "\1", Crime$data)
Crime
data offense_code offense_desc
1 123 Crime Description A 123 Crime Description A
2 345 Crime Description B 345 Crime Description B
3 678 Crime Description C 678 Crime Description C
4 91011 Crime Description D 91011 Crime Description D
5 678(a)(1) Crime Description E 678(a)(1) Crime Description E
6 345(a)(32)(i) Crime Description F 345(a)(32)(i) Crime Description F
7 143(a)(16) Crime Description G 143(a)(16) Crime Description G
8 678.08(a) Crime Description H 678.08(a) Crime Description H
9 976.D1 Crime Description I 976.D1 Crime Description I
此处使用的通用正则表达式表示匹配:
^ from the start of the data field
\d+ an integer
(?:\.\w+)? followed by optional dot and word component
(?:\(.*?\))* followed by zero or more (...) terms
[ ] a single space
.* then match the entire description
$ until the end of the data field
您可以在第一个空格处拆分。使用 tidyr::separate
你可以使用 -
tidyr::separate(df, 'Crime', c('offense_code', 'offense_description'),
sep = '\s', extra = 'merge')
# offense_code offense_description
#1 123 Crime Description A
#2 345 Crime Description B
#3 678 Crime Description C
#4 91011 Crime Description D
#5 678(a)(1) Crime Description E
#6 345(a)(32)(i) Crime Description F
#7 143(a)(16) Crime Description G
#8 678.08(a) Crime Description H
#9 976.D1 Crime Description I
如果要在输出中保留原始列,请添加 remove = FALSE
。
使用 base R
read.csv
read.csv(text = sub("\s+", ",", df1$Crime), header = FALSE, col.names = c('offense_code', 'offense_description'))
offense_code offense_description
1 123 Crime Description A
2 345 Crime Description B
3 678 Crime Description C
4 91011 Crime Description D
5 678(a)(1) Crime Description E
6 345(a)(32)(i) Crime Description F
7 143(a)(16) Crime Description G
8 678.08(a) Crime Description H
9 976.D1 Crime Description I
数据
df1 <- structure(list(Crime = c("123 Crime Description A", "345 Crime Description B",
"678 Crime Description C", "91011 Crime Description D", "678(a)(1) Crime Description E",
"345(a)(32)(i) Crime Description F", "143(a)(16) Crime Description G",
"678.08(a) Crime Description H", "976.D1 Crime Description I"
)), class = "data.frame", row.names = c(NA, -9L))