合并两个具有不同结构的数据集并添加行
Merge two datasets with different structure and adding rows
我有两个要合并的数据框(支出和橱柜),它们的列数和行数不同。
Df 支出具有变量 year 完整的每个国家从 1995 年到 2019 年,而 df 内阁遗漏了一些数据点,即我 没有完整的系列 但有些国家只有 1996-1997-1999-2004-2005-2007 其他国家有不同的结构(因为这是基于选举日期)。
我基本上想从 df cabinets 添加列 (主要是因为我需要可变极化)到 df 支出 但我不能解决行数不同的问题。
第一个有这个结构,有 800 个 obs 和 130 个变量:
2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
2013, 2014), country = c("Austria", "Austria", "Austria", "Austria",
"Austria", "Austria", "Austria", "Austria", "Austria", "Austria",
"Austria", "Austria", "Austria", "Austria", "Austria", "Austria",
"Austria", "Austria", "Austria", "Austria"), abv = c("aut", "aut",
"aut", "aut", "aut", "aut", "aut", "aut", "aut", "aut", "aut",
"aut", "aut", "aut", "aut", "aut", "aut", "aut", "aut", "aut"
), country_n = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1), unit = c("million euro", "million euro", "million euro",
"million euro", "million euro", "million euro", "million euro",
"million euro", "million euro", "million euro", "million euro",
"million euro", "million euro", "million euro", "million euro",
"million euro", "million euro", "million euro", "million euro",
"million euro"), election_date = c("17/12/95", NA, NA, NA, "03/10/99",
NA, NA, "24/11/02", NA, NA, NA, "01/10/06", NA, "28/09/08", NA,
NA, NA, NA, "29/09/13", NA), fract_leg = c(0.71, 0.71, 0.71,
0.71, 0.71, 0.71, 0.71, 0.65, 0.65, 0.65, 0.65, 0.7, 0.7, 0.77,
0.77, 0.77, 0.77, 0.77, 0.78, 0.78), enp_leg = c(3.49, 3.49,
3.49, 3.49, 3.41, 3.41, 3.41, 2.88, 2.88, 2.88, 2.88, 3.37, 3.37,
4.27, 4.27, 4.27, 4.27, 4.27, 4.59, 4.59)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))```
The second look like this, with 375 obs and 7 columns:
```structure(list(country = c("Austria", "Austria", "Austria", "Austria",
"Austria", "Austria", "Austria", "Austria", "Austria", "Austria",
"Austria", "Austria", "Austria", "Austria", "Belgium", "Belgium",
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium",
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium",
"Belgium", "Bulgaria"), abv = c("AUT", "AUT", "AUT", "AUT", "AUT",
"AUT", "AUT", "AUT", "AUT", "AUT", "AUT", "AUT", "AUT", "AUT",
"BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BEL",
"BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BGR"), year = c(1996,
1997, 1999, 2000, 2002, 2003, 2005, 2007, 2008, 2013, 2016, 2017,
2019, 2019, 1995, 1999, 2003, 2007, 2007, 2008, 2008, 2009, 2010,
2011, 2014, 2014, 2018, 2019, 2019, 1995), election_date = c("17/12/95",
"17/12/95", "03/10/99", "03/10/99", "24/11/02", "24/11/02", "24/11/02",
"01/10/06", "28/09/08", "29/09/13", "29/09/13", "15/10/17", "15/10/17",
"29/09/19", "21/05/95", "13/06/99", "18/05/03", "10/06/07", "10/06/07",
"10/06/07", "10/06/07", "10/06/07", "13/06/10", "13/06/10", "25/05/14",
"25/05/14", "25/05/14", "26/05/19", "26/05/19", "18/12/94"),
cabinet_name = c("Vranitzky V", "Klima I", "Klima II", "Schuessel I",
"Schuessel II", "Schuessel III", "Schuessel IV", "Gusenbauer",
"Faymann I", "Faymann II", "Kern", "Kurz I", "Bierlein I",
"Bierlein II", "Dehaene II", "Verhofstadt I", "Verhofstadt II",
"Verhofstadt III", "Verhofstadt IV", "Leterme I", "Rompuy",
"Leterme II", "Leterme III", "Di Rupo", "Di Rupo II", "Michel I",
"Michel II", "Michel III", "Wilmes I", "Videnov"), polarization = c("2,744",
"2,744", "2,744", "1,8761", "1,8761", "1,8761", "2,3567",
"2,744", "2,744", "2,744", "2,744", "1,8761", "caretaker",
"caretaker", "2,836", "4,4291", "4,0746", "4,0746", "4,0746",
"4,0746", "4,0746", "4,0746", "3,7582", "4,0746", "4,0746",
"1,2386", "1,2386", "1,2386", "1,2386", "0")), row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))```
据我了解,您希望避免所有行加速并将信息添加到文件柜中。因此,可以使用dplyr
连接方式:
output <- left_join(expediture, cabinet, by = c("country", "year")
这会将cabinet的列添加到expediture,并将NAs写入cabinet dataframe中没有信息的行。
我有两个要合并的数据框(支出和橱柜),它们的列数和行数不同。
Df 支出具有变量 year 完整的每个国家从 1995 年到 2019 年,而 df 内阁遗漏了一些数据点,即我 没有完整的系列 但有些国家只有 1996-1997-1999-2004-2005-2007 其他国家有不同的结构(因为这是基于选举日期)。
我基本上想从 df cabinets 添加列 (主要是因为我需要可变极化)到 df 支出 但我不能解决行数不同的问题。
第一个有这个结构,有 800 个 obs 和 130 个变量:
2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
2013, 2014), country = c("Austria", "Austria", "Austria", "Austria",
"Austria", "Austria", "Austria", "Austria", "Austria", "Austria",
"Austria", "Austria", "Austria", "Austria", "Austria", "Austria",
"Austria", "Austria", "Austria", "Austria"), abv = c("aut", "aut",
"aut", "aut", "aut", "aut", "aut", "aut", "aut", "aut", "aut",
"aut", "aut", "aut", "aut", "aut", "aut", "aut", "aut", "aut"
), country_n = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1), unit = c("million euro", "million euro", "million euro",
"million euro", "million euro", "million euro", "million euro",
"million euro", "million euro", "million euro", "million euro",
"million euro", "million euro", "million euro", "million euro",
"million euro", "million euro", "million euro", "million euro",
"million euro"), election_date = c("17/12/95", NA, NA, NA, "03/10/99",
NA, NA, "24/11/02", NA, NA, NA, "01/10/06", NA, "28/09/08", NA,
NA, NA, NA, "29/09/13", NA), fract_leg = c(0.71, 0.71, 0.71,
0.71, 0.71, 0.71, 0.71, 0.65, 0.65, 0.65, 0.65, 0.7, 0.7, 0.77,
0.77, 0.77, 0.77, 0.77, 0.78, 0.78), enp_leg = c(3.49, 3.49,
3.49, 3.49, 3.41, 3.41, 3.41, 2.88, 2.88, 2.88, 2.88, 3.37, 3.37,
4.27, 4.27, 4.27, 4.27, 4.27, 4.59, 4.59)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))```
The second look like this, with 375 obs and 7 columns:
```structure(list(country = c("Austria", "Austria", "Austria", "Austria",
"Austria", "Austria", "Austria", "Austria", "Austria", "Austria",
"Austria", "Austria", "Austria", "Austria", "Belgium", "Belgium",
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium",
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium",
"Belgium", "Bulgaria"), abv = c("AUT", "AUT", "AUT", "AUT", "AUT",
"AUT", "AUT", "AUT", "AUT", "AUT", "AUT", "AUT", "AUT", "AUT",
"BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BEL",
"BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BGR"), year = c(1996,
1997, 1999, 2000, 2002, 2003, 2005, 2007, 2008, 2013, 2016, 2017,
2019, 2019, 1995, 1999, 2003, 2007, 2007, 2008, 2008, 2009, 2010,
2011, 2014, 2014, 2018, 2019, 2019, 1995), election_date = c("17/12/95",
"17/12/95", "03/10/99", "03/10/99", "24/11/02", "24/11/02", "24/11/02",
"01/10/06", "28/09/08", "29/09/13", "29/09/13", "15/10/17", "15/10/17",
"29/09/19", "21/05/95", "13/06/99", "18/05/03", "10/06/07", "10/06/07",
"10/06/07", "10/06/07", "10/06/07", "13/06/10", "13/06/10", "25/05/14",
"25/05/14", "25/05/14", "26/05/19", "26/05/19", "18/12/94"),
cabinet_name = c("Vranitzky V", "Klima I", "Klima II", "Schuessel I",
"Schuessel II", "Schuessel III", "Schuessel IV", "Gusenbauer",
"Faymann I", "Faymann II", "Kern", "Kurz I", "Bierlein I",
"Bierlein II", "Dehaene II", "Verhofstadt I", "Verhofstadt II",
"Verhofstadt III", "Verhofstadt IV", "Leterme I", "Rompuy",
"Leterme II", "Leterme III", "Di Rupo", "Di Rupo II", "Michel I",
"Michel II", "Michel III", "Wilmes I", "Videnov"), polarization = c("2,744",
"2,744", "2,744", "1,8761", "1,8761", "1,8761", "2,3567",
"2,744", "2,744", "2,744", "2,744", "1,8761", "caretaker",
"caretaker", "2,836", "4,4291", "4,0746", "4,0746", "4,0746",
"4,0746", "4,0746", "4,0746", "3,7582", "4,0746", "4,0746",
"1,2386", "1,2386", "1,2386", "1,2386", "0")), row.names = c(NA,
-30L), class = c("tbl_df", "tbl", "data.frame"))```
据我了解,您希望避免所有行加速并将信息添加到文件柜中。因此,可以使用dplyr
连接方式:
output <- left_join(expediture, cabinet, by = c("country", "year")
这会将cabinet的列添加到expediture,并将NAs写入cabinet dataframe中没有信息的行。