从数据框中拆分行名
Split rownames from data frame
对于文本挖掘项目,我必须调查单词列表随时间的发展。为此,我需要拆分行名,以便我有一个包含公司名称的列和一个包含年份的列。这是我的数据框的摘录:
abs access allow analysis application approach base big business challenge company
Adidas_2010.txt 13 25 26 11 41 132 1 266 13 115 1
Adidas_2011.txt 1 3 1 0 0 8 0 11 2 10 0
Adidas_2012.txt 29 35 37 22 110 181 7 384 31 136 3
Adidas_2013.txt 28 47 38 32 180 184 4 451 30 129 3
Adidas_2014.txt 12 42 38 27 159 207 6 921 32 128 6
Adidas_2016.txt 30 47 50 47 162 251 9 1061 32 171 13
Nike_2009.txt 16 15 17 12 33 177 9 346 93 196 1
Nike_2011.txt 10 30 0 3 0 0 0 81 7 31 0
Nike_2012.txt 21 22 12 57 199 300 7 214 11 107 3
Nike_2013.txt 20 32 30 11 123 321 4 331 90 239 3
Nike_2014.txt 33 43 30 33 119 137 6 441 67 318 6
Nike_2015.txt 51 42 41 27 102 151 9 1061 32 221 13
这是我的代码:
dtm <- DocumentTermMatrix(corpus, control=list(dictionary = word_list))
df1 <- data.frame(as.matrix(dtm), row.names = filenames_annualreports)
我试过这个:
names_plus_year <- rownames(df1)
names_plus_year_split <- strsplit(names_plus_year, "_")
rownames(df1) <- sapply(names_plus_year_split, "[", 1)
但我收到以下错误:
Error in `.rowNamesDF<-`(x, value = value) :
double 'row.names' not allowed
还有其他拆分行名的方法吗?非常感谢! :)
您可以拆分行名,按行绑定它们,然后按列将它们绑定到您的数据框,即
cbind.data.frame(df, do.call(rbind, strsplit(sub('\..*','' ,rownames(df)), '_')))
这给出了,
abs access allow analysis application approach base big business challenge company 1 2
Adidas_2010.txt 13 25 26 11 41 132 1 266 13 115 1 Adidas 2010
Adidas_2011.txt 1 3 1 0 0 8 0 11 2 10 0 Adidas 2011
Adidas_2012.txt 29 35 37 22 110 181 7 384 31 136 3 Adidas 2012
Adidas_2013.txt 28 47 38 32 180 184 4 451 30 129 3 Adidas 2013
Adidas_2014.txt 12 42 38 27 159 207 6 921 32 128 6 Adidas 2014
Adidas_2016.txt 30 47 50 47 162 251 9 1061 32 171 13 Adidas 2016
Nike_2009.txt 16 15 17 12 33 177 9 346 93 196 1 Nike 2009
Nike_2011.txt 10 30 0 3 0 0 0 81 7 31 0 Nike 2011
Nike_2012.txt 21 22 12 57 199 300 7 214 11 107 3 Nike 2012
Nike_2013.txt 20 32 30 11 123 321 4 331 90 239 3 Nike 2013
Nike_2014.txt 33 43 30 33 119 137 6 441 67 318 6 Nike 2014
Nike_2015.txt 51 42 41 27 102 151 9 1061 32 221 13 Nike 2015
您可以照常更改名称。
数据
dput(df)
structure(list(abs = c(13L, 1L, 29L, 28L, 12L, 30L, 16L, 10L,
21L, 20L, 33L, 51L), access = c(25L, 3L, 35L, 47L, 42L, 47L,
15L, 30L, 22L, 32L, 43L, 42L), allow = c(26L, 1L, 37L, 38L, 38L,
50L, 17L, 0L, 12L, 30L, 30L, 41L), analysis = c(11L, 0L, 22L,
32L, 27L, 47L, 12L, 3L, 57L, 11L, 33L, 27L), application = c(41L,
0L, 110L, 180L, 159L, 162L, 33L, 0L, 199L, 123L, 119L, 102L),
approach = c(132L, 8L, 181L, 184L, 207L, 251L, 177L, 0L,
300L, 321L, 137L, 151L), base = c(1L, 0L, 7L, 4L, 6L, 9L,
9L, 0L, 7L, 4L, 6L, 9L), big = c(266L, 11L, 384L, 451L, 921L,
1061L, 346L, 81L, 214L, 331L, 441L, 1061L), business = c(13L,
2L, 31L, 30L, 32L, 32L, 93L, 7L, 11L, 90L, 67L, 32L), challenge = c(115L,
10L, 136L, 129L, 128L, 171L, 196L, 31L, 107L, 239L, 318L,
221L), company = c(1L, 0L, 3L, 3L, 6L, 13L, 1L, 0L, 3L, 3L,
6L, 13L)), row.names = c("Adidas_2010.txt", "Adidas_2011.txt",
"Adidas_2012.txt", "Adidas_2013.txt", "Adidas_2014.txt", "Adidas_2016.txt",
"Nike_2009.txt", "Nike_2011.txt", "Nike_2012.txt", "Nike_2013.txt",
"Nike_2014.txt", "Nike_2015.txt"), class = "data.frame")
嗨玛丽使用@Sotos数据
library(tidyverse)
new_df <- df %>%
rownames_to_column(var = "row_name") %>%
separate(row_name,sep = "_",into = c("name","year")) %>%
mutate(year = year %>% str_remove(".txt"))
new_df %>% as_tibble()
# A tibble: 12 x 13
name year abs access allow analysis application approach base big business challenge company
<chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 Adidas 2010 13 25 26 11 41 132 1 266 13 115 1
2 Adidas 2011 1 3 1 0 0 8 0 11 2 10 0
3 Adidas 2012 29 35 37 22 110 181 7 384 31 136 3
4 Adidas 2013 28 47 38 32 180 184 4 451 30 129 3
5 Adidas 2014 12 42 38 27 159 207 6 921 32 128 6
6 Adidas 2016 30 47 50 47 162 251 9 1061 32 171 13
7 Nike 2009 16 15 17 12 33 177 9 346 93 196 1
8 Nike 2011 10 30 0 3 0 0 0 81 7 31 0
9 Nike 2012 21 22 12 57 199 300 7 214 11 107 3
10 Nike 2013 20 32 30 11 123 321 4 331 90 239 3
11 Nike 2014 33 43 30 33 119 137 6 441 67 318 6
12 Nike 2015 51 42 41 27 102 151 9 1061 32 221 13
您也可以使用 tidyverse 函数来完成:
library(tibble)
library(magrittr)
suppressPackageStartupMessages( library(tidyr) )
# example from base dataset
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# convert row names to column
test <- tibble::rownames_to_column(.data = mtcars, var = "ID") %>%
tibble::as_tibble()
# to check the value one want to find
test %>% .$ID %>% unique()
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
#> [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
#> [7] "Duster 360" "Merc 240D" "Merc 230"
#> [10] "Merc 280" "Merc 280C" "Merc 450SE"
#> [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
#> [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
#> [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
#> [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
#> [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
#> [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
#> [31] "Maserati Bora" "Volvo 142E"
# try the code to split you columns
test %>%
tidyr::separate(data = ., col = "ID",
into = c("Brand", "Model"),
sep = " ",
extra = "merge"
)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].
#> # A tibble: 32 x 13
#> Brand Model mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4
#> 2 Mazda RX4 ~ 21 6 160 110 3.9 2.88 17.0 0 1 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 4 Hornet 4 Dr~ 21.4 6 258 110 3.08 3.22 19.4 1 0 3
#> 5 Hornet Spor~ 18.7 8 360 175 3.15 3.44 17.0 0 0 3
#> 6 Valia~ <NA> 18.1 6 225 105 2.76 3.46 20.2 1 0 3
#> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3
#> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
#> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
#> # ... with 22 more rows, and 1 more variable: carb <dbl>
对于文本挖掘项目,我必须调查单词列表随时间的发展。为此,我需要拆分行名,以便我有一个包含公司名称的列和一个包含年份的列。这是我的数据框的摘录:
abs access allow analysis application approach base big business challenge company
Adidas_2010.txt 13 25 26 11 41 132 1 266 13 115 1
Adidas_2011.txt 1 3 1 0 0 8 0 11 2 10 0
Adidas_2012.txt 29 35 37 22 110 181 7 384 31 136 3
Adidas_2013.txt 28 47 38 32 180 184 4 451 30 129 3
Adidas_2014.txt 12 42 38 27 159 207 6 921 32 128 6
Adidas_2016.txt 30 47 50 47 162 251 9 1061 32 171 13
Nike_2009.txt 16 15 17 12 33 177 9 346 93 196 1
Nike_2011.txt 10 30 0 3 0 0 0 81 7 31 0
Nike_2012.txt 21 22 12 57 199 300 7 214 11 107 3
Nike_2013.txt 20 32 30 11 123 321 4 331 90 239 3
Nike_2014.txt 33 43 30 33 119 137 6 441 67 318 6
Nike_2015.txt 51 42 41 27 102 151 9 1061 32 221 13
这是我的代码:
dtm <- DocumentTermMatrix(corpus, control=list(dictionary = word_list))
df1 <- data.frame(as.matrix(dtm), row.names = filenames_annualreports)
我试过这个:
names_plus_year <- rownames(df1)
names_plus_year_split <- strsplit(names_plus_year, "_")
rownames(df1) <- sapply(names_plus_year_split, "[", 1)
但我收到以下错误:
Error in `.rowNamesDF<-`(x, value = value) :
double 'row.names' not allowed
还有其他拆分行名的方法吗?非常感谢! :)
您可以拆分行名,按行绑定它们,然后按列将它们绑定到您的数据框,即
cbind.data.frame(df, do.call(rbind, strsplit(sub('\..*','' ,rownames(df)), '_')))
这给出了,
abs access allow analysis application approach base big business challenge company 1 2 Adidas_2010.txt 13 25 26 11 41 132 1 266 13 115 1 Adidas 2010 Adidas_2011.txt 1 3 1 0 0 8 0 11 2 10 0 Adidas 2011 Adidas_2012.txt 29 35 37 22 110 181 7 384 31 136 3 Adidas 2012 Adidas_2013.txt 28 47 38 32 180 184 4 451 30 129 3 Adidas 2013 Adidas_2014.txt 12 42 38 27 159 207 6 921 32 128 6 Adidas 2014 Adidas_2016.txt 30 47 50 47 162 251 9 1061 32 171 13 Adidas 2016 Nike_2009.txt 16 15 17 12 33 177 9 346 93 196 1 Nike 2009 Nike_2011.txt 10 30 0 3 0 0 0 81 7 31 0 Nike 2011 Nike_2012.txt 21 22 12 57 199 300 7 214 11 107 3 Nike 2012 Nike_2013.txt 20 32 30 11 123 321 4 331 90 239 3 Nike 2013 Nike_2014.txt 33 43 30 33 119 137 6 441 67 318 6 Nike 2014 Nike_2015.txt 51 42 41 27 102 151 9 1061 32 221 13 Nike 2015
您可以照常更改名称。
数据
dput(df)
structure(list(abs = c(13L, 1L, 29L, 28L, 12L, 30L, 16L, 10L,
21L, 20L, 33L, 51L), access = c(25L, 3L, 35L, 47L, 42L, 47L,
15L, 30L, 22L, 32L, 43L, 42L), allow = c(26L, 1L, 37L, 38L, 38L,
50L, 17L, 0L, 12L, 30L, 30L, 41L), analysis = c(11L, 0L, 22L,
32L, 27L, 47L, 12L, 3L, 57L, 11L, 33L, 27L), application = c(41L,
0L, 110L, 180L, 159L, 162L, 33L, 0L, 199L, 123L, 119L, 102L),
approach = c(132L, 8L, 181L, 184L, 207L, 251L, 177L, 0L,
300L, 321L, 137L, 151L), base = c(1L, 0L, 7L, 4L, 6L, 9L,
9L, 0L, 7L, 4L, 6L, 9L), big = c(266L, 11L, 384L, 451L, 921L,
1061L, 346L, 81L, 214L, 331L, 441L, 1061L), business = c(13L,
2L, 31L, 30L, 32L, 32L, 93L, 7L, 11L, 90L, 67L, 32L), challenge = c(115L,
10L, 136L, 129L, 128L, 171L, 196L, 31L, 107L, 239L, 318L,
221L), company = c(1L, 0L, 3L, 3L, 6L, 13L, 1L, 0L, 3L, 3L,
6L, 13L)), row.names = c("Adidas_2010.txt", "Adidas_2011.txt",
"Adidas_2012.txt", "Adidas_2013.txt", "Adidas_2014.txt", "Adidas_2016.txt",
"Nike_2009.txt", "Nike_2011.txt", "Nike_2012.txt", "Nike_2013.txt",
"Nike_2014.txt", "Nike_2015.txt"), class = "data.frame")
嗨玛丽使用@Sotos数据
library(tidyverse)
new_df <- df %>%
rownames_to_column(var = "row_name") %>%
separate(row_name,sep = "_",into = c("name","year")) %>%
mutate(year = year %>% str_remove(".txt"))
new_df %>% as_tibble()
# A tibble: 12 x 13
name year abs access allow analysis application approach base big business challenge company
<chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 Adidas 2010 13 25 26 11 41 132 1 266 13 115 1
2 Adidas 2011 1 3 1 0 0 8 0 11 2 10 0
3 Adidas 2012 29 35 37 22 110 181 7 384 31 136 3
4 Adidas 2013 28 47 38 32 180 184 4 451 30 129 3
5 Adidas 2014 12 42 38 27 159 207 6 921 32 128 6
6 Adidas 2016 30 47 50 47 162 251 9 1061 32 171 13
7 Nike 2009 16 15 17 12 33 177 9 346 93 196 1
8 Nike 2011 10 30 0 3 0 0 0 81 7 31 0
9 Nike 2012 21 22 12 57 199 300 7 214 11 107 3
10 Nike 2013 20 32 30 11 123 321 4 331 90 239 3
11 Nike 2014 33 43 30 33 119 137 6 441 67 318 6
12 Nike 2015 51 42 41 27 102 151 9 1061 32 221 13
您也可以使用 tidyverse 函数来完成:
library(tibble)
library(magrittr)
suppressPackageStartupMessages( library(tidyr) )
# example from base dataset
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# convert row names to column
test <- tibble::rownames_to_column(.data = mtcars, var = "ID") %>%
tibble::as_tibble()
# to check the value one want to find
test %>% .$ID %>% unique()
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
#> [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
#> [7] "Duster 360" "Merc 240D" "Merc 230"
#> [10] "Merc 280" "Merc 280C" "Merc 450SE"
#> [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
#> [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
#> [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
#> [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
#> [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
#> [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
#> [31] "Maserati Bora" "Volvo 142E"
# try the code to split you columns
test %>%
tidyr::separate(data = ., col = "ID",
into = c("Brand", "Model"),
sep = " ",
extra = "merge"
)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].
#> # A tibble: 32 x 13
#> Brand Model mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4
#> 2 Mazda RX4 ~ 21 6 160 110 3.9 2.88 17.0 0 1 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 4 Hornet 4 Dr~ 21.4 6 258 110 3.08 3.22 19.4 1 0 3
#> 5 Hornet Spor~ 18.7 8 360 175 3.15 3.44 17.0 0 0 3
#> 6 Valia~ <NA> 18.1 6 225 105 2.76 3.46 20.2 1 0 3
#> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3
#> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
#> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
#> # ... with 22 more rows, and 1 more variable: carb <dbl>