如何将每行可变数量的列转换为 R 中的多行?

How to convert a variable number of columns per row to multiple rows in R?

这是我的问题:我有一个单独的数据库:1 行 = 1 人。对于每个人,都有一个唯一的标识符(“INAMI_key”)、单独的变量(下例中的“code_qualif”)和一个或多个通过不同列填写的地址。地址数在“n_addresses”变量中表示:1个表示一个地址,2个表示两个地址,以此类推。不同的地址在变量“travail_ruex”(街道)和“travail_code_postalx"(邮政编码)。这是它的样子:

   INAMI_key code_qualif n_adresses travail_rue1                       travail_code_postal1 travail_rue2                       travail_code_postal2 travail_rue3                       travail_code_postal3 travail_rue4                       travail_code_postal4 travail_rue5                       travail_code_postal5
 1 30000120  001         02         "RUE VAN ARTEVELDE               " " 1000"              "paul pastur                     " " 6180"              ""                                 ""                   ""                                 ""                   ""                                 ""                  
 2 30000417  001         01         "av Margueritr depasse           " " 6060"              ""                                 ""                   ""                                 ""                   ""                                 ""                   ""                                 ""                  
 3 37603435  007         01         "du Grand Veneur                 " " 1170"              ""                                 ""                   ""                                 ""                   ""                                 ""                   ""                                 ""                  
 4 38300152  007         02         "RUE WAUTERS                92-94" " 6040"              "de châtelet                     " " 6120"              ""                                 ""                   ""                                 ""                   ""                                 ""                  
 5 38707849  001         03         "de Campine                      " " 4000"              "de Chestret                     " " 4000"              "de la Gare                      " " 4020"              ""                                 ""                   ""                                 ""                  
 6 38813856  001         03         "Torhoutste steenweg             " " 8400"              "Lage Kaart                      " " 2930"              "De Vrièrestraat                 " " 8301"              ""                                 ""                   ""                                 ""                  
 7 38811084  001         04         "chaussée de Waterloo            " " 1180"              "avenue Napoléon                 " " 1420"              "rue Léon Théodor                " " 1090"              "avenue de la Basilique          " " 1081"              ""                                 ""                  
 8 39105054  001         04         "EMILE  CLAUS                    " " 1050"              "RUE DU FOYER SCHAERBEEKOIS      " " 1030"              "RUE XAVIER DE BUE               " " 1180"              "BV LAMBERMONT                   " " 1030"              ""                                 ""                  
 9 39117031  001         05         "KERKSTRAAT                      " " 3850"              "Pater Richard van de Wouwerstraa" " 3271"              "Wilderenlaan                    " " 3803"              "Gyzevennestraat                 " " 3560"              "Molenveldstraat                 " " 3500"             
10 31823918  070         05         "Route de l'Etat                 " " 1380"              "Avenue Paul Hymans              " " 1200"              "Avenue WInston Churchill        " " 1180"              "Avenue Winston Churchill        " " 1180"              "avenue hippocrate               " " 1200"             

这里有在 R 中导入此示例的代码:

structure(list(INAMI_key = c("30000120", "30000417", "37603435", 
"38300152", "38707849", "38813856", "38811084", "39105054", "39117031", 
"31823918"), code_qualif = c("001", "001", "007", "007", "001", 
"001", "001", "001", "001", "070"), n_adresses = c("02", "01", 
"01", "02", "03", "03", "04", "04", "05", "05"), travail_rue1 = c("RUE VAN ARTEVELDE               ", 
"av Margueritr depasse           ", "du Grand Veneur                 ", 
"RUE WAUTERS                92-94", "de Campine                      ", 
"Torhoutste steenweg             ", "chaussée de Waterloo            ", 
"EMILE  CLAUS                    ", "KERKSTRAAT                      ", 
"Route de l'Etat                 "), travail_code_postal1 = c(" 1000", 
" 6060", " 1170", " 6040", " 4000", " 8400", " 1180", " 1050", 
" 3850", " 1380"), travail_rue2 = c("paul pastur                     ", 
"", "", "de châtelet                     ", "de Chestret                     ", 
"Lage Kaart                      ", "avenue Napoléon                 ", 
"RUE DU FOYER SCHAERBEEKOIS      ", "Pater Richard van de Wouwerstraa", 
"Avenue Paul Hymans              "), travail_code_postal2 = c(" 6180", 
"", "", " 6120", " 4000", " 2930", " 1420", " 1030", " 3271", 
" 1200"), travail_rue3 = c("", "", "", "", "de la Gare                      ", 
"De Vrièrestraat                 ", "rue Léon Théodor                ", 
"RUE XAVIER DE BUE               ", "Wilderenlaan                    ", 
"Avenue WInston Churchill        "), travail_code_postal3 = c("", 
"", "", "", " 4020", " 8301", " 1090", " 1180", " 3803", " 1180"
), travail_rue4 = c("", "", "", "", "", "", "avenue de la Basilique          ", 
"BV LAMBERMONT                   ", "Gyzevennestraat                 ", 
"Avenue Winston Churchill        "), travail_code_postal4 = c("", 
"", "", "", "", "", " 1081", " 1030", " 3560", " 1180"), travail_rue5 = c("", 
"", "", "", "", "", "", "", "Molenveldstraat                 ", 
"avenue hippocrate               "), travail_code_postal5 = c("", 
"", "", "", "", "", "", "", " 3500", " 1200")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

我想做的是将每个人的行数相乘,以在不同的行上显示不同的地址,但在相同的字段中。例如,如果一个人有 3 个地址,则为同一个人创建 3 行,保留个人变量,但将地址重新组织到具有相同名称的列中:“travail_rue_total”和“travail_code_postal_total”在下面的例子中。如果个人有 1 个地址,则创建 1 行,如果他有 5 个地址,则创建 5 行,依此类推。:

   INAMI_key code_qualif n_adresses travail_rue_total                travail_code_postal_total
 1  30000120           1          2 RUE VAN ARTEVELDE                                     1000
 2  30000120           1          2 paul pastur                                           6180
 3  30000417           1          1 av Margueritr depasse                                 6060
 4  37603435           7          1 du Grand Veneur                                       1170
 5  38300152           7          2 RUE WAUTERS                92-94                      6040
 6  38300152           7          2 de châtelet                                           6120
 7  38707849           1          3 de Campine                                            4000
 8  38707849           1          3 de Chestret                                           4000
 9  38707849           1          3 de la Gare                                            4020
10  38813856           1          3 Torhoutste steenweg                                   8400
11  38813856           1          3 Lage Kaart                                            2930
12  38813856           1          3 De Vrièrestraat                                       8301
13  38811084           1          4 chaussée de Waterloo                                  1180
14  38811084           1          4 avenue Napoléon                                       1420
15  38811084           1          4 rue Léon Théodor                                      1090
16  38811084           1          4 avenue de la Basilique                                1081
17  39105054           1          4 EMILE  CLAUS                                          1050
18  39105054           1          4 RUE DU FOYER SCHAERBEEKOIS                            1030
19  39105054           1          4 RUE XAVIER DE BUE                                     1180
20  39105054           1          4 BV LAMBERMONT                                         1030
21  39117031           1          5 KERKSTRAAT                                            3850
22  39117031           1          5 Pater Richard van de Wouwerstraa                      3271
23  39117031           1          5 Wilderenlaan                                          3803
24  39117031           1          5 Gyzevennestraat                                       3560
25  39117031           1          5 Molenveldstraat                                       3500
26  31823918          70          5 Route de l'Etat                                       1380
27  31823918          70          5 Avenue Paul Hymans                                    1200
28  31823918          70          5 Avenue WInston Churchill                              1180
29  31823918          70          5 Avenue Winston Churchill                              1180
30  31823918          70          5 avenue hippocrate                                     1200

这是数据的简化版本。在整个数据库中,我有 40 个地址 x 15 个变量(街道、号码、城市、邮政编码、机构...)。

谢谢!

您可以使用 pivot_longer 获取长格式数据,使用 filter 删除空值。

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols = starts_with('travail'), 
               names_to = '.value', 
               names_pattern = 'travail_(.*?)\d+') %>%
  filter(rue != '')

#  INAMI_key code_qualif n_adresses rue                                code_postal
#   <chr>     <chr>       <chr>      <chr>                              <chr>      
# 1 30000120  001         02         "RUE VAN ARTEVELDE               " " 1000"    
# 2 30000120  001         02         "paul pastur                     " " 6180"    
# 3 30000417  001         01         "av Margueritr depasse           " " 6060"    
# 4 37603435  007         01         "du Grand Veneur                 " " 1170"    
# 5 38300152  007         02         "RUE WAUTERS                92-94" " 6040"    
# 6 38300152  007         02         "de châtelet                     " " 6120"    
# 7 38707849  001         03         "de Campine                      " " 4000"    
# 8 38707849  001         03         "de Chestret                     " " 4000"    
# 9 38707849  001         03         "de la Gare                      " " 4020"    
#10 38813856  001         03         "Torhoutste steenweg             " " 8400"    
# … with 20 more rows