R 如何从行值中提取变量?
R how to extract variable from row values?
我有一个问题,我似乎无法解决,尽管尝试使用 tidyverse 函数 pivot_longer 和 pivot_widerm 但我似乎无法提取存储在下面一行中的年份变量变量名。
上图为数据截图,工作示例相同,只是变量名称已翻译成英文。
这是一个最小的工作示例
library(tidyverse)
library(janitor)
file <- read_delim("C:/Users/Probst/Downloads/testdata.csv",
delim = ";",
escape_double = FALSE,
trim_ws = TRUE,
col_types = list(col_double()),
locale = locale(decimal_mark = ",",
grouping_mark = ".")) |>
clean_names() |>
select(-raumeinheit, # not interesting
-aggregat) # not interesting
The raw data (testdata.csv) is as follows
spatialid;Raumeinheit;Aggregat;pop_male;pop_male;pop_male;avg_pop_age; avg_pop_age;avg_pop_age
;;;2017;2018;2019;2017;2018;2019
"01001";"Flensburg, Stadt";"kreisfreie Stadt";"44.086";"44.599";"44.904";"42,17";"42,03";"42,00"
"01002";"Kiel, Stadt";"kreisfreie Stadt";"120.809";"120.566";"120.198";"41,53";"41,59";"41,72"
"02000";"Hamburg, Stadt";"kreisfreie Stadt";"897.207";"902.048";"903.974";"41,67";"41,67";"41,66"
(read_delim(file=I(...) unfortunately did not work for a cleaner working example.)
You will get the same file, with the following code:
file2 <- tribble(~spatialid, ~pop_male_4, ~pop_male_5, ~pop_mal_6, ~avg_pop_age_7, ~avg_pop_age_8, ~avg_pop_age_9,
NA, 2017, 2018, 2019, 2017, 2018, 2019,
1001, 44086, 44599, 44904, 42.2, 42.0, 42,
1002, 120809, 120566, 120198, 41.5, 41.6, 41.7,
2000, 897207, 902048, 903974, 41.7, 41.7, 41.7)
pop_male 是每个空间单元的男性人口,avg_pop_age 是每个空间单元的平均年龄。
数据集的问题是,年份变量存储在变量名称下方的行中,我似乎无法提取它。
我希望得到一个“整洁”的数据框,它看起来像这样:
# rouding errors
desired_result <-
tribble(~spatial_id, ~year, ~pop_male, ~avg_pop_age,
1001, 2017, 44086, 42.2,
1001, 2018, 44599, 42.0,
1001, 2019, 44904, 42,
1002, 2017, 120809, 41.5,
1002, 2018, 120566, 41.6,
1002, 2019, 120198, 41.7,
2000, 2017, 897207, 41.7,
2000, 2018, 902048, 41.7,
2000, 2019, 903974, 41.7)
非常感谢任何帮助或提示。
对于问题的第一部分,跳过第一行然后使用 colnames()
:
手动重命名列可能更容易
file <- read.delim("C:/Users/Probst/Downloads/testdata.csv",
delim = ";",
skip = 1)
colnames(file) <- c("spatialid",
"Raumeinheit",
"Aggregat",
"pop_male.2017",
"pop_male.2018",
"pop_male.2019",
"avg_pop_age.2017",
"avg_pop_age.2018",
"avg_pop_age.2019")
如果您在列名中使用点分隔度量值与年份,则可以使用此代码。 names_pattern =
参数包含一个简单的正则表达式,表明它应该将列名分成两部分:(.*)
是第一部分和第二部分(这意味着:任何数字中的任何字符),\.
表示:用字面点分隔。然后,您可以将其旋转得更宽,以便达到预期的输出。
file |>
pivot_longer(cols = pop_male.2017:avg_pop_age.2019,
names_to = c("measure", "year"),
names_pattern = "(.*)\.(.*)") |>
pivot_wider(names_from = measure,
values_from = value)
输出:
# spatialid Raumeinheit Aggregat year pop_male avg_pop_age
# <chr> <chr> <chr> <chr> <chr> <chr>
#1 01001 Flensburg, Stadt kreisfreie Stadt 2017 44.086 42,17
#2 01001 Flensburg, Stadt kreisfreie Stadt 2018 44.599 42,03
#3 01001 Flensburg, Stadt kreisfreie Stadt 2019 44.904 42,00
#4 01002 Kiel, Stadt kreisfreie Stadt 2017 120.809 41,53
#5 01002 Kiel, Stadt kreisfreie Stadt 2018 120.566 41,59
#6 01002 Kiel, Stadt kreisfreie Stadt 2019 120.198 41,72
#7 02000 Hamburg, Stadt kreisfreie Stadt 2017 897.207 41,67
#8 02000 Hamburg, Stadt kreisfreie Stadt 2018 902.048 41,67
#9 02000 Hamburg, Stadt kreisfreie Stadt 2019 903.974 41,66
我有一个问题,我似乎无法解决,尽管尝试使用 tidyverse 函数 pivot_longer 和 pivot_widerm 但我似乎无法提取存储在下面一行中的年份变量变量名。
上图为数据截图,工作示例相同,只是变量名称已翻译成英文。
这是一个最小的工作示例
library(tidyverse)
library(janitor)
file <- read_delim("C:/Users/Probst/Downloads/testdata.csv",
delim = ";",
escape_double = FALSE,
trim_ws = TRUE,
col_types = list(col_double()),
locale = locale(decimal_mark = ",",
grouping_mark = ".")) |>
clean_names() |>
select(-raumeinheit, # not interesting
-aggregat) # not interesting
The raw data (testdata.csv) is as follows
spatialid;Raumeinheit;Aggregat;pop_male;pop_male;pop_male;avg_pop_age; avg_pop_age;avg_pop_age
;;;2017;2018;2019;2017;2018;2019
"01001";"Flensburg, Stadt";"kreisfreie Stadt";"44.086";"44.599";"44.904";"42,17";"42,03";"42,00"
"01002";"Kiel, Stadt";"kreisfreie Stadt";"120.809";"120.566";"120.198";"41,53";"41,59";"41,72"
"02000";"Hamburg, Stadt";"kreisfreie Stadt";"897.207";"902.048";"903.974";"41,67";"41,67";"41,66"
(read_delim(file=I(...) unfortunately did not work for a cleaner working example.)
You will get the same file, with the following code:
file2 <- tribble(~spatialid, ~pop_male_4, ~pop_male_5, ~pop_mal_6, ~avg_pop_age_7, ~avg_pop_age_8, ~avg_pop_age_9,
NA, 2017, 2018, 2019, 2017, 2018, 2019,
1001, 44086, 44599, 44904, 42.2, 42.0, 42,
1002, 120809, 120566, 120198, 41.5, 41.6, 41.7,
2000, 897207, 902048, 903974, 41.7, 41.7, 41.7)
pop_male 是每个空间单元的男性人口,avg_pop_age 是每个空间单元的平均年龄。 数据集的问题是,年份变量存储在变量名称下方的行中,我似乎无法提取它。
我希望得到一个“整洁”的数据框,它看起来像这样:
# rouding errors
desired_result <-
tribble(~spatial_id, ~year, ~pop_male, ~avg_pop_age,
1001, 2017, 44086, 42.2,
1001, 2018, 44599, 42.0,
1001, 2019, 44904, 42,
1002, 2017, 120809, 41.5,
1002, 2018, 120566, 41.6,
1002, 2019, 120198, 41.7,
2000, 2017, 897207, 41.7,
2000, 2018, 902048, 41.7,
2000, 2019, 903974, 41.7)
非常感谢任何帮助或提示。
对于问题的第一部分,跳过第一行然后使用 colnames()
:
file <- read.delim("C:/Users/Probst/Downloads/testdata.csv",
delim = ";",
skip = 1)
colnames(file) <- c("spatialid",
"Raumeinheit",
"Aggregat",
"pop_male.2017",
"pop_male.2018",
"pop_male.2019",
"avg_pop_age.2017",
"avg_pop_age.2018",
"avg_pop_age.2019")
如果您在列名中使用点分隔度量值与年份,则可以使用此代码。 names_pattern =
参数包含一个简单的正则表达式,表明它应该将列名分成两部分:(.*)
是第一部分和第二部分(这意味着:任何数字中的任何字符),\.
表示:用字面点分隔。然后,您可以将其旋转得更宽,以便达到预期的输出。
file |>
pivot_longer(cols = pop_male.2017:avg_pop_age.2019,
names_to = c("measure", "year"),
names_pattern = "(.*)\.(.*)") |>
pivot_wider(names_from = measure,
values_from = value)
输出:
# spatialid Raumeinheit Aggregat year pop_male avg_pop_age
# <chr> <chr> <chr> <chr> <chr> <chr>
#1 01001 Flensburg, Stadt kreisfreie Stadt 2017 44.086 42,17
#2 01001 Flensburg, Stadt kreisfreie Stadt 2018 44.599 42,03
#3 01001 Flensburg, Stadt kreisfreie Stadt 2019 44.904 42,00
#4 01002 Kiel, Stadt kreisfreie Stadt 2017 120.809 41,53
#5 01002 Kiel, Stadt kreisfreie Stadt 2018 120.566 41,59
#6 01002 Kiel, Stadt kreisfreie Stadt 2019 120.198 41,72
#7 02000 Hamburg, Stadt kreisfreie Stadt 2017 897.207 41,67
#8 02000 Hamburg, Stadt kreisfreie Stadt 2018 902.048 41,67
#9 02000 Hamburg, Stadt kreisfreie Stadt 2019 903.974 41,66