如何通过 R 或 Bash 从 SOAM 结果中重塑平面数据?
How to reshape flat data from SOAM result by R or Bash?
我使用 SOAP 从 BRENDA 酶中提取数据。提取后我得到以下平面数据类型:
ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6
我想将数据重塑为以下类型:
ecNumber
piValue
3.2.1.23
6.9
3.2.1.23
7.1
4.4.1.14
6
我可以使用 awk 函数来实现吗?或者某种 bash 命令?还是 R?
以后,请先post您尝试过的解决方案。最好 post 一个问题,其中包含有关您首先尝试解决该问题的信息,而不是只问 'how do I do this?'
也就是说,这在 R
中很容易做到。
library(tidyverse)
# full string
main = "ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6"
# split the string by delimiters
split_vec <- str_split(main, pattern = "#|!")
# arrange into tibble
df <- tibble(split_vec) %>%
unnest(c(split_vec)) %>%
mutate(col_name = str_extract(string = split_vec, pattern = "ecNumber|piValue"),
split_vec = gsub(x = split_vec, pattern = "ecNumber|piValue", "")) %>%
# trick to make sure that rows 1,2 and 3,4 etc. get labeled together -> this is our needed 'grouper' variable
mutate(rn = ceiling(row_number()/2)); df
#> # A tibble: 6 × 3
#> split_vec col_name rn
#> <chr> <chr> <dbl>
#> 1 3.2.1.23 ecNumber 1
#> 2 6.9 piValue 1
#> 3 3.2.1.23 ecNumber 2
#> 4 7.1 piValue 2
#> 5 4.4.1.14 ecNumber 3
#> 6 6 piValue 3
# final answer
df2 <- df %>%
# spread the columns wider to get the dataframe into your specifications
pivot_wider(id_cols = rn,
names_from = col_name,
values_from = split_vec) %>%
dplyr::select(-rn)
df2
#> # A tibble: 3 × 2
#> ecNumber piValue
#> <chr> <chr>
#> 1 3.2.1.23 6.9
#> 2 3.2.1.23 7.1
#> 3 4.4.1.14 6
由 reprex package (v2.0.1)
于 2022-04-15 创建
在base R
中,我们可以在插入\n
后使用read.dcf
str2 <- gsub("#", "\n", gsub("!", "\n\n", gsub("([a-z])([0-9])", "\1: \2", str1)))
read.dcf(textConnection(str2), all = TRUE)
ecNumber piValue
1 3.2.1.23 6.9
2 3.2.1.23 7.1
3 4.4.1.14 6
数据
str1 <- "ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6"
我使用 SOAP 从 BRENDA 酶中提取数据。提取后我得到以下平面数据类型:
ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6
我想将数据重塑为以下类型:
ecNumber | piValue |
---|---|
3.2.1.23 | 6.9 |
3.2.1.23 | 7.1 |
4.4.1.14 | 6 |
我可以使用 awk 函数来实现吗?或者某种 bash 命令?还是 R?
以后,请先post您尝试过的解决方案。最好 post 一个问题,其中包含有关您首先尝试解决该问题的信息,而不是只问 'how do I do this?'
也就是说,这在 R
中很容易做到。
library(tidyverse)
# full string
main = "ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6"
# split the string by delimiters
split_vec <- str_split(main, pattern = "#|!")
# arrange into tibble
df <- tibble(split_vec) %>%
unnest(c(split_vec)) %>%
mutate(col_name = str_extract(string = split_vec, pattern = "ecNumber|piValue"),
split_vec = gsub(x = split_vec, pattern = "ecNumber|piValue", "")) %>%
# trick to make sure that rows 1,2 and 3,4 etc. get labeled together -> this is our needed 'grouper' variable
mutate(rn = ceiling(row_number()/2)); df
#> # A tibble: 6 × 3
#> split_vec col_name rn
#> <chr> <chr> <dbl>
#> 1 3.2.1.23 ecNumber 1
#> 2 6.9 piValue 1
#> 3 3.2.1.23 ecNumber 2
#> 4 7.1 piValue 2
#> 5 4.4.1.14 ecNumber 3
#> 6 6 piValue 3
# final answer
df2 <- df %>%
# spread the columns wider to get the dataframe into your specifications
pivot_wider(id_cols = rn,
names_from = col_name,
values_from = split_vec) %>%
dplyr::select(-rn)
df2
#> # A tibble: 3 × 2
#> ecNumber piValue
#> <chr> <chr>
#> 1 3.2.1.23 6.9
#> 2 3.2.1.23 7.1
#> 3 4.4.1.14 6
由 reprex package (v2.0.1)
于 2022-04-15 创建在base R
中,我们可以在插入\n
read.dcf
str2 <- gsub("#", "\n", gsub("!", "\n\n", gsub("([a-z])([0-9])", "\1: \2", str1)))
read.dcf(textConnection(str2), all = TRUE)
ecNumber piValue
1 3.2.1.23 6.9
2 3.2.1.23 7.1
3 4.4.1.14 6
数据
str1 <- "ecNumber3.2.1.23#piValue6.9!ecNumber3.2.1.23#piValue7.1!ecNumber4.4.1.14#piValue6"