从数据框中拆分行名

Question

对于文本挖掘项目，我必须调查单词列表随时间的发展。为此，我需要拆分行名，以便我有一个包含公司名称的列和一个包含年份的列。这是我的数据框的摘录：

                    abs  access   allow     analysis application approach base big business challenge company 
Adidas_2010.txt     13    25       26          11       41        132   1      266        13     115       1
Adidas_2011.txt      1     3        1           0        0         8   0       11         2      10       0
Adidas_2012.txt     29    35       37          22      110        181   7      384        31     136       3
Adidas_2013.txt     28    47       38          32      180        184   4      451        30     129       3
Adidas_2014.txt     12    42       38          27      159        207   6      921        32     128       6
Adidas_2016.txt     30    47       50          47      162        251   9     1061        32     171      13
Nike_2009.txt       16    15       17          12       33        177   9      346        93     196       1
Nike_2011.txt       10    30        0           3        0         0    0       81         7      31       0
Nike_2012.txt       21    22       12          57      199        300   7      214        11     107       3
Nike_2013.txt       20    32       30          11      123        321   4      331        90     239       3
Nike_2014.txt       33    43       30          33      119        137   6      441        67     318       6
Nike_2015.txt       51    42       41          27      102        151   9     1061        32     221      13

这是我的代码：

dtm <- DocumentTermMatrix(corpus, control=list(dictionary = word_list))
df1 <- data.frame(as.matrix(dtm), row.names = filenames_annualreports)

我试过这个：

 names_plus_year <- rownames(df1)
 names_plus_year_split <- strsplit(names_plus_year, "_")
 rownames(df1) <- sapply(names_plus_year_split, "[", 1)

但我收到以下错误：

Error in `.rowNamesDF<-`(x, value = value) : 
  double 'row.names' not allowed

还有其他拆分行名的方法吗？非常感谢！ :)

Answer 1

您可以拆分行名，按行绑定它们，然后按列将它们绑定到您的数据框，即

 cbind.data.frame(df, do.call(rbind, strsplit(sub('\..*','' ,rownames(df)), '_')))

这给出了，

                abs access allow analysis application approach base  big business challenge company      1    2
Adidas_2010.txt  13     25    26       11          41      132    1  266       13       115       1 Adidas 2010
Adidas_2011.txt   1      3     1        0           0        8    0   11        2        10       0 Adidas 2011
Adidas_2012.txt  29     35    37       22         110      181    7  384       31       136       3 Adidas 2012
Adidas_2013.txt  28     47    38       32         180      184    4  451       30       129       3 Adidas 2013
Adidas_2014.txt  12     42    38       27         159      207    6  921       32       128       6 Adidas 2014
Adidas_2016.txt  30     47    50       47         162      251    9 1061       32       171      13 Adidas 2016
Nike_2009.txt    16     15    17       12          33      177    9  346       93       196       1   Nike 2009
Nike_2011.txt    10     30     0        3           0        0    0   81        7        31       0   Nike 2011
Nike_2012.txt    21     22    12       57         199      300    7  214       11       107       3   Nike 2012
Nike_2013.txt    20     32    30       11         123      321    4  331       90       239       3   Nike 2013
Nike_2014.txt    33     43    30       33         119      137    6  441       67       318       6   Nike 2014
Nike_2015.txt    51     42    41       27         102      151    9 1061       32       221      13   Nike 2015

您可以照常更改名称。

数据

dput(df)
structure(list(abs = c(13L, 1L, 29L, 28L, 12L, 30L, 16L, 10L, 
21L, 20L, 33L, 51L), access = c(25L, 3L, 35L, 47L, 42L, 47L, 
15L, 30L, 22L, 32L, 43L, 42L), allow = c(26L, 1L, 37L, 38L, 38L, 
50L, 17L, 0L, 12L, 30L, 30L, 41L), analysis = c(11L, 0L, 22L, 
32L, 27L, 47L, 12L, 3L, 57L, 11L, 33L, 27L), application = c(41L, 
0L, 110L, 180L, 159L, 162L, 33L, 0L, 199L, 123L, 119L, 102L), 
    approach = c(132L, 8L, 181L, 184L, 207L, 251L, 177L, 0L, 
    300L, 321L, 137L, 151L), base = c(1L, 0L, 7L, 4L, 6L, 9L, 
    9L, 0L, 7L, 4L, 6L, 9L), big = c(266L, 11L, 384L, 451L, 921L, 
    1061L, 346L, 81L, 214L, 331L, 441L, 1061L), business = c(13L, 
    2L, 31L, 30L, 32L, 32L, 93L, 7L, 11L, 90L, 67L, 32L), challenge = c(115L, 
    10L, 136L, 129L, 128L, 171L, 196L, 31L, 107L, 239L, 318L, 
    221L), company = c(1L, 0L, 3L, 3L, 6L, 13L, 1L, 0L, 3L, 3L, 
    6L, 13L)), row.names = c("Adidas_2010.txt", "Adidas_2011.txt", 
"Adidas_2012.txt", "Adidas_2013.txt", "Adidas_2014.txt", "Adidas_2016.txt", 
"Nike_2009.txt", "Nike_2011.txt", "Nike_2012.txt", "Nike_2013.txt", 
"Nike_2014.txt", "Nike_2015.txt"), class = "data.frame")

Answer 2

嗨玛丽使用@Sotos数据

library(tidyverse)
new_df <- df %>% 
  rownames_to_column(var = "row_name") %>% 
  separate(row_name,sep = "_",into = c("name","year")) %>% 
  mutate(year = year %>% str_remove(".txt"))


new_df %>% as_tibble()
# A tibble: 12 x 13
   name   year    abs access allow analysis application approach  base   big business challenge company
   <chr>  <chr> <int>  <int> <int>    <int>       <int>    <int> <int> <int>    <int>     <int>   <int>
 1 Adidas 2010     13     25    26       11          41      132     1   266       13       115       1
 2 Adidas 2011      1      3     1        0           0        8     0    11        2        10       0
 3 Adidas 2012     29     35    37       22         110      181     7   384       31       136       3
 4 Adidas 2013     28     47    38       32         180      184     4   451       30       129       3
 5 Adidas 2014     12     42    38       27         159      207     6   921       32       128       6
 6 Adidas 2016     30     47    50       47         162      251     9  1061       32       171      13
 7 Nike   2009     16     15    17       12          33      177     9   346       93       196       1
 8 Nike   2011     10     30     0        3           0        0     0    81        7        31       0
 9 Nike   2012     21     22    12       57         199      300     7   214       11       107       3
10 Nike   2013     20     32    30       11         123      321     4   331       90       239       3
11 Nike   2014     33     43    30       33         119      137     6   441       67       318       6
12 Nike   2015     51     42    41       27         102      151     9  1061       32       221      13

Answer 3

您也可以使用 tidyverse 函数来完成：

library(tibble)
library(magrittr)
suppressPackageStartupMessages( library(tidyr) )

# example from base dataset
head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# convert row names to column
test <- tibble::rownames_to_column(.data = mtcars, var = "ID") %>% 
        tibble::as_tibble()

# to check the value one want to find
test %>% .$ID %>% unique()
#>  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
#>  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
#>  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
#> [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
#> [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
#> [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
#> [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
#> [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
#> [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
#> [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
#> [31] "Maserati Bora"       "Volvo 142E"

# try the code to split you columns
test %>% 
        tidyr::separate(data = ., col = "ID", 
                        into = c("Brand", "Model"),
                        sep = " ",
                        extra = "merge"
        )
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].
#> # A tibble: 32 x 13
#>    Brand  Model   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear
#>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda  RX4    21       6  160    110  3.9   2.62  16.5     0     1     4
#>  2 Mazda  RX4 ~  21       6  160    110  3.9   2.88  17.0     0     1     4
#>  3 Datsun 710    22.8     4  108     93  3.85  2.32  18.6     1     1     4
#>  4 Hornet 4 Dr~  21.4     6  258    110  3.08  3.22  19.4     1     0     3
#>  5 Hornet Spor~  18.7     8  360    175  3.15  3.44  17.0     0     0     3
#>  6 Valia~ <NA>   18.1     6  225    105  2.76  3.46  20.2     1     0     3
#>  7 Duster 360    14.3     8  360    245  3.21  3.57  15.8     0     0     3
#>  8 Merc   240D   24.4     4  147.    62  3.69  3.19  20       1     0     4
#>  9 Merc   230    22.8     4  141.    95  3.92  3.15  22.9     1     0     4
#> 10 Merc   280    19.2     6  168.   123  3.92  3.44  18.3     1     0     4
#> # ... with 22 more rows, and 1 more variable: carb <dbl>

从数据框中拆分行名

Split rownames from data frame

split

r

dataframe

tm