R 如何从行值中提取变量?

R how to extract variable from row values?

我有一个问题,我似乎无法解决,尽管尝试使用 tidyverse 函数 pivot_longer 和 pivot_widerm 但我似乎无法提取存储在下面一行中的年份变量变量名。

上图为数据截图,工作示例相同,只是变量名称已翻译成英文。

这是一个最小的工作示例

library(tidyverse)
library(janitor)

file <- read_delim("C:/Users/Probst/Downloads/testdata.csv", 
                   delim = ";", 
                   escape_double = FALSE, 
                   trim_ws = TRUE,
                   col_types = list(col_double()), 
                   locale = locale(decimal_mark = ",",
                                   grouping_mark = ".")) |>
  clean_names() |> 
  select(-raumeinheit, # not interesting
         -aggregat)    # not interesting

The raw data (testdata.csv) is as follows 

spatialid;Raumeinheit;Aggregat;pop_male;pop_male;pop_male;avg_pop_age; avg_pop_age;avg_pop_age
;;;2017;2018;2019;2017;2018;2019
"01001";"Flensburg, Stadt";"kreisfreie Stadt";"44.086";"44.599";"44.904";"42,17";"42,03";"42,00"
"01002";"Kiel, Stadt";"kreisfreie Stadt";"120.809";"120.566";"120.198";"41,53";"41,59";"41,72"
"02000";"Hamburg, Stadt";"kreisfreie Stadt";"897.207";"902.048";"903.974";"41,67";"41,67";"41,66"

(read_delim(file=I(...) unfortunately did not work for a cleaner working example.)

You will get the same file, with the following code:

file2 <- tribble(~spatialid, ~pop_male_4, ~pop_male_5, ~pop_mal_6, ~avg_pop_age_7, ~avg_pop_age_8, ~avg_pop_age_9,
                NA,     2017, 2018, 2019, 2017, 2018, 2019,
                1001,   44086,  44599,  44904,  42.2, 42.0,    42,
                1002,   120809, 120566, 120198, 41.5, 41.6,  41.7,
                2000,   897207, 902048, 903974, 41.7, 41.7,  41.7)

pop_male 是每个空间单元的男性人口,avg_pop_age 是每个空间单元的平均年龄。 数据集的问题是,年份变量存储在变量名称下方的行中,我似乎无法提取它。

我希望得到一个“整洁”的数据框,它看起来像这样:

# rouding errors
desired_result <- 
tribble(~spatial_id, ~year, ~pop_male, ~avg_pop_age,
        1001,       2017,  44086,      42.2,
        1001,       2018,  44599,      42.0,
        1001,       2019,  44904,      42,
        1002,       2017,  120809,     41.5, 
        1002,       2018,  120566,     41.6,
        1002,       2019,  120198,     41.7,
        2000,       2017,  897207,     41.7, 
        2000,       2018,  902048,     41.7,
        2000,       2019,  903974,     41.7)

非常感谢任何帮助或提示。

对于问题的第一部分,跳过第一行然后使用 colnames():

手动重命名列可能更容易
file <- read.delim("C:/Users/Probst/Downloads/testdata.csv",
           delim = ";",
           skip = 1)

colnames(file) <-  c("spatialid",
                     "Raumeinheit",
                     "Aggregat",
                     "pop_male.2017",
                     "pop_male.2018",
                     "pop_male.2019",
                     "avg_pop_age.2017",
                     "avg_pop_age.2018",
                     "avg_pop_age.2019")

如果您在列名中使用点分隔度量值与年份,则可以使用此代码。 names_pattern = 参数包含一个简单的正则表达式,表明它应该将列名分成两部分:(.*) 是第一部分和第二部分(这意味着:任何数字中的任何字符),\. 表示:用字面点分隔。然后,您可以将其旋转得更宽,以便达到预期的输出。

file |>
  pivot_longer(cols = pop_male.2017:avg_pop_age.2019,
               names_to = c("measure", "year"),
               names_pattern = "(.*)\.(.*)") |>
  pivot_wider(names_from = measure,
              values_from = value)

输出:

#  spatialid Raumeinheit      Aggregat         year  pop_male avg_pop_age
#  <chr>     <chr>            <chr>            <chr> <chr>    <chr>      
#1 01001     Flensburg, Stadt kreisfreie Stadt 2017  44.086   42,17      
#2 01001     Flensburg, Stadt kreisfreie Stadt 2018  44.599   42,03      
#3 01001     Flensburg, Stadt kreisfreie Stadt 2019  44.904   42,00      
#4 01002     Kiel, Stadt      kreisfreie Stadt 2017  120.809  41,53      
#5 01002     Kiel, Stadt      kreisfreie Stadt 2018  120.566  41,59      
#6 01002     Kiel, Stadt      kreisfreie Stadt 2019  120.198  41,72      
#7 02000     Hamburg, Stadt   kreisfreie Stadt 2017  897.207  41,67      
#8 02000     Hamburg, Stadt   kreisfreie Stadt 2018  902.048  41,67      
#9 02000     Hamburg, Stadt   kreisfreie Stadt 2019  903.974  41,66