用变量和年份分隔列名
Separate column names with variable and year
我正在使用来自 UNCTAD Eora 数据库的数据集 (https://worldmrio.com/unctadgvc/Database_GVC_2018update_rev0323.csv)
它提供了来自数据库的主要指标。
不幸的是,涵盖一组国家 1990-2018 年 5 个指标的数据集包含 147 个变量,因为 变量名称包含年份和指标 :例如DVX1990、DVX1991、...、VA_exp1990、VA_exp1991。变量名称 没有被 space 或任何其他分隔符 分隔。数字直接跟在字母后面。
head(SADC_DVX_VAexp)
V1 DVX1990 DVX1991 DVX1992 DVX1993 DVX1994
1: Angola 416258.300 446696.900 597805.100 499129.600 655503.900
2: Botswana 56364.310 54098.040 58326.030 58459.560 64559.360
3: DR Congo 236784.500 192829.600 191504.100 178502.700 198157.000
4: Lesotho 6222.682 5691.562 7669.519 8340.936 8990.118
5: Madagascar 79830.220 79074.500 91451.380 95090.820 110577.600
6: Malawi 51759.320 57607.130 56463.370 50866.980 50288.080
VA_exp1990 VA_exp1991 VA_exp1992 VA_exp1993 VA_exp1994 VA_exp1995
1: 1258915.00 1392318.00 1870753.00 1608752.00 2085250.00 1521991.00
2: 247510.00 247904.60 260424.40 245404.40 259269.50 303265.70
3: 417597.40 330625.30 316347.40 291127.50 317557.20 395383.60
4: 63759.54 59936.89 68551.54 59214.12 54261.97 55090.95
5: 325254.90 320890.60 370122.30 395407.80 456894.60 541780.80
6: 205316.70 244511.70 233732.50 210306.90 217603.90 255140.10
我想将宽数据集转换成长数据集,如下所示
V1 Year DVX VA_exp FVA GVC DVA
1: Angola 1990 value value value value value
2: Angola 1991 value value value value value
3: Angola 1992 value value value value value
4: ... ... value value value value value
5: Botswana 1990 value value value value value
6: Botswana 1991 value value value value value
但是,到目前为止我发现的函数要么单独处理每个变量,要么 and/or 仅当名称中有 space 或分隔符时才分隔名称。我想我需要应用一些东西来将 last 四位数字与列名称分开,但我不确定使用哪个函数。
如果有任何建议,我将不胜感激!
到目前为止,这是我的代码
# install packages
library(data.table)
library(dplyr)
library(tidyr)
# load the data ----------------------------------------------------
# set working directory
setwd("XXX")
# load cvs data into R
my_path <- file.path("Database_GVC_2018update_rev0323_Main indicators by country.csv")
DVX_VAexp <- fread(my_path)
# select countries of interest
# vector countries of interest
SADCNames <- c("Angola", "Botswana", "Swaziland", "Comoros", "DR Congo", "Lesotho", "Madagascar", "Malawi", "Mauritius", "Mozambique", "Namibia", "Seychelles", "South Africa", "Tanzania", "Zambia","Zimbabwe")
SADC_DVX_VAexp <- DVX_VAexp %>%
filter(DVX_VAexp$V1 %in% SADCNames)
您可以使用 pivot_longer
作为 -
tidyr::pivot_longer(SADC_DVX_VAexp, cols = -V1,
names_to = c('.value', 'year'),
names_pattern = '(.*?)(\d+)')
# V1 year DVX VA_exp
# <chr> <chr> <dbl> <dbl>
# 1 Angola 1990 416258. 1258915
# 2 Angola 1991 446697. 1392318
# 3 Angola 1992 597805. 1870753
# 4 Angola 1993 499130. 1608752
# 5 Angola 1994 655504. 2085250
# 6 Angola 1995 NA 1521991
# 7 Botswana 1990 56364. 247510
# 8 Botswana 1991 54098. 247905.
# 9 Botswana 1992 58326. 260424.
#10 Botswana 1993 58460. 245404.
# … with 26 more rows
数据
如果您在 reproducible format
中提供数据,会更容易提供帮助
SADC_DVX_VAexp <- structure(list(V1 = c("Angola", "Botswana", "DRCongo", "Lesotho",
"Madagascar", "Malawi"), DVX1990 = c(416258.3, 56364.31, 236784.5,
6222.682, 79830.22, 51759.32), DVX1991 = c(446696.9, 54098.04,
192829.6, 5691.562, 79074.5, 57607.13), DVX1992 = c(597805.1,
58326.03, 191504.1, 7669.519, 91451.38, 56463.37), DVX1993 = c(499129.6,
58459.56, 178502.7, 8340.936, 95090.82, 50866.98), DVX1994 = c(655503.9,
64559.36, 198157, 8990.118, 110577.6, 50288.08), VA_exp1990 = c(1258915,
247510, 417597.4, 63759.54, 325254.9, 205316.7), VA_exp1991 = c(1392318,
247904.6, 330625.3, 59936.89, 320890.6, 244511.7), VA_exp1992 = c(1870753,
260424.4, 316347.4, 68551.54, 370122.3, 233732.5), VA_exp1993 = c(1608752,
245404.4, 291127.5, 59214.12, 395407.8, 210306.9), VA_exp1994 = c(2085250,
259269.5, 317557.2, 54261.97, 456894.6, 217603.9), VA_exp1995 = c(1521991,
303265.7, 395383.6, 55090.95, 541780.8, 255140.1)), class = "data.frame", row.names = c(NA, -6L))
我正在使用来自 UNCTAD Eora 数据库的数据集 (https://worldmrio.com/unctadgvc/Database_GVC_2018update_rev0323.csv) 它提供了来自数据库的主要指标。 不幸的是,涵盖一组国家 1990-2018 年 5 个指标的数据集包含 147 个变量,因为 变量名称包含年份和指标 :例如DVX1990、DVX1991、...、VA_exp1990、VA_exp1991。变量名称 没有被 space 或任何其他分隔符 分隔。数字直接跟在字母后面。
head(SADC_DVX_VAexp)
V1 DVX1990 DVX1991 DVX1992 DVX1993 DVX1994
1: Angola 416258.300 446696.900 597805.100 499129.600 655503.900
2: Botswana 56364.310 54098.040 58326.030 58459.560 64559.360
3: DR Congo 236784.500 192829.600 191504.100 178502.700 198157.000
4: Lesotho 6222.682 5691.562 7669.519 8340.936 8990.118
5: Madagascar 79830.220 79074.500 91451.380 95090.820 110577.600
6: Malawi 51759.320 57607.130 56463.370 50866.980 50288.080
VA_exp1990 VA_exp1991 VA_exp1992 VA_exp1993 VA_exp1994 VA_exp1995
1: 1258915.00 1392318.00 1870753.00 1608752.00 2085250.00 1521991.00
2: 247510.00 247904.60 260424.40 245404.40 259269.50 303265.70
3: 417597.40 330625.30 316347.40 291127.50 317557.20 395383.60
4: 63759.54 59936.89 68551.54 59214.12 54261.97 55090.95
5: 325254.90 320890.60 370122.30 395407.80 456894.60 541780.80
6: 205316.70 244511.70 233732.50 210306.90 217603.90 255140.10
我想将宽数据集转换成长数据集,如下所示
V1 Year DVX VA_exp FVA GVC DVA
1: Angola 1990 value value value value value
2: Angola 1991 value value value value value
3: Angola 1992 value value value value value
4: ... ... value value value value value
5: Botswana 1990 value value value value value
6: Botswana 1991 value value value value value
但是,到目前为止我发现的函数要么单独处理每个变量,要么 and/or 仅当名称中有 space 或分隔符时才分隔名称。我想我需要应用一些东西来将 last 四位数字与列名称分开,但我不确定使用哪个函数。
如果有任何建议,我将不胜感激!
到目前为止,这是我的代码
# install packages
library(data.table)
library(dplyr)
library(tidyr)
# load the data ----------------------------------------------------
# set working directory
setwd("XXX")
# load cvs data into R
my_path <- file.path("Database_GVC_2018update_rev0323_Main indicators by country.csv")
DVX_VAexp <- fread(my_path)
# select countries of interest
# vector countries of interest
SADCNames <- c("Angola", "Botswana", "Swaziland", "Comoros", "DR Congo", "Lesotho", "Madagascar", "Malawi", "Mauritius", "Mozambique", "Namibia", "Seychelles", "South Africa", "Tanzania", "Zambia","Zimbabwe")
SADC_DVX_VAexp <- DVX_VAexp %>%
filter(DVX_VAexp$V1 %in% SADCNames)
您可以使用 pivot_longer
作为 -
tidyr::pivot_longer(SADC_DVX_VAexp, cols = -V1,
names_to = c('.value', 'year'),
names_pattern = '(.*?)(\d+)')
# V1 year DVX VA_exp
# <chr> <chr> <dbl> <dbl>
# 1 Angola 1990 416258. 1258915
# 2 Angola 1991 446697. 1392318
# 3 Angola 1992 597805. 1870753
# 4 Angola 1993 499130. 1608752
# 5 Angola 1994 655504. 2085250
# 6 Angola 1995 NA 1521991
# 7 Botswana 1990 56364. 247510
# 8 Botswana 1991 54098. 247905.
# 9 Botswana 1992 58326. 260424.
#10 Botswana 1993 58460. 245404.
# … with 26 more rows
数据
如果您在 reproducible format
中提供数据,会更容易提供帮助SADC_DVX_VAexp <- structure(list(V1 = c("Angola", "Botswana", "DRCongo", "Lesotho",
"Madagascar", "Malawi"), DVX1990 = c(416258.3, 56364.31, 236784.5,
6222.682, 79830.22, 51759.32), DVX1991 = c(446696.9, 54098.04,
192829.6, 5691.562, 79074.5, 57607.13), DVX1992 = c(597805.1,
58326.03, 191504.1, 7669.519, 91451.38, 56463.37), DVX1993 = c(499129.6,
58459.56, 178502.7, 8340.936, 95090.82, 50866.98), DVX1994 = c(655503.9,
64559.36, 198157, 8990.118, 110577.6, 50288.08), VA_exp1990 = c(1258915,
247510, 417597.4, 63759.54, 325254.9, 205316.7), VA_exp1991 = c(1392318,
247904.6, 330625.3, 59936.89, 320890.6, 244511.7), VA_exp1992 = c(1870753,
260424.4, 316347.4, 68551.54, 370122.3, 233732.5), VA_exp1993 = c(1608752,
245404.4, 291127.5, 59214.12, 395407.8, 210306.9), VA_exp1994 = c(2085250,
259269.5, 317557.2, 54261.97, 456894.6, 217603.9), VA_exp1995 = c(1521991,
303265.7, 395383.6, 55090.95, 541780.8, 255140.1)), class = "data.frame", row.names = c(NA, -6L))