使用脏数据集创建邻接矩阵

Creating adjacency matrix with dirty dataset

这是我第二次发布这个问题,由于缺乏可重复性,我删除了第一个问题。

我参考了之前回答的问题(创建邻接矩阵和社交网络图、从原始数据创建邻接矩阵以获得中心性、清理长数据集中的一列)但我在数据清理和创建矩阵之间挣扎.

这是我工作的 df 的一部分 -

Species     Association              
1 RC          SKS/BW                   
2 BW          Sykes, rc                
3 SKS         Babo/bw                  
4 RC          baboon, mangabey         
5 Mang        red colobus, bw, sykes   
6 SKS         babo/red duiker

我正在努力创建一个简单的社交网络矩阵来回答“谁以什么频率与谁联系”。

为了清理数据,我选择了所需的列(物种和关联)并创建了一个列来指示收集此数据的特定站点

df.clean <-  mutate(df, Association=fct_collapse(Association, 
  BW=c("SKS/BW" ,"Babo/bw", "red colobus, bw, sykes"), 
  RC=c("Sykes, rc" ,"red colobus, bw, sykes"), 
  SKS=c("SKS/BW", "Sykes, rc", "red colobus, bw, sykes"), 
  Mang=c("baboon, mangabey"), 
  BABO=c("Babo/bw", "baboon, mangabey", "babo/red duiker"), 
  RD=c("babo/red duiker")) %>% 
select(Species, Association) %>% 
add_column(Site = "Protected") %>% 
filter(Species!= "RD", Association!= "RD") %>% 
mutate(Species = factor(as.character(Species)))

但是,当我在这一步之后查看“关联”列时,我在整个列中只看到一个物种值(即 bw 而不是 bw,rc)。

我假设我在清理时使用 'fct_collapse()' 函数破坏了我的数据集?我正在寻找这样的输出数据框 -

Species     Association              Site
1 RC          SKS, BW                  Protected
2 BW          SKS, RC                  Protected
3 SKS         BABO, BW                 Protected
4 RC          BABO, Mang               Protected
5 Mang        RC, BW, SKS              Protected
6 SKS         BABO                     Protected

这让我想到了我的第一个问题 - 在保留列中信息的多个值的同时清理像这样的脏数据的最佳方法是什么?我正在尝试创建一个类似于上面示例的数据框,假设我需要将物种和关联列都编码为数值以创建我的矩阵。这会像写的那样工作,还是我需要从列中提取数据并创建新列?

我对 r 比较陌生,所以如果我没有任何意义,请告诉我。非常感谢任何建议,如果有任何混淆,我深表歉意。

根据好心评论者的 运行 代码,我 运行 陷入了关联列的编码问题。一切运行良好,除了,只要有“R/c 猴子”或“B/W 疣猴”,就会有“NA”用于关联,基本上任何时候在关联的命名中都有“/” .这里

对输出样本数据帧进行故障排除

structure(list(Species = structure(c(2L, 4L, 5L, 2L, 4L, 1L, 
5L, 4L, 5L, 4L), .Label = c("BABO", "BW", "Mang", "RC", "SKS"
), class = "factor"), Association = c("r/c monkeys", "b/w colobus", 
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus", 
".", ".", ".", "r/c monkeys", "sykes monkeys"), year = c(12, 
12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

Dput 输出看起来像 -

Species    Association                  Year
<fctr>     <chr>                        <dbl>
BW         r/c monkeys                  12
RC         b/w colobus                  12
SKS        b/w colobus/R/c monkeys      12
BW         sykes/R/c monkeys            12
RC         sykes/b/w colobus            12
BABO       .                            12
SKS        .                            12
RC         .                            12
SKS        r/c monkeys                  12
RC         sykes monkeys                12

理想dput输出-

Species    Association                  Year
<fctr>     <chr>                        <dbl>
BW         RC                           12
RC         BW                           12
SKS        BW, RC                       12
BW         SKS, RC                      12
RC         SKS, BW                      12
BABO       NA                           12
SKS        NA                           12
RC         NA                           12
SKS        RC                           12
RC         SKS                          12

使用 strsplit()toString()。只需使用涵盖所有情况的 正则表达式 ,例如 '(?<=\w{2})\/|,\s'。享受playground*.

*请注意,那里只需要单转义 \,而在 R 中需要双转义 \

regex <- '(?<=\w{2})\/|,\s'

为了修复不同的名称版本,我们首先使用 toupper(),这已经消除了一些痛苦。然后使用 dictionary,您可以使用 read.table()

巧妙地将其粘贴到脚本中
dict <- read.table(header=TRUE, text='
            from  to
             "."  .
   "B/W COLOBUS"  BW
          "BABO"  BABO
        "BABOON"  BABO
            "BW"  BW
      "MANGABEY"  MANG
   "R/C MONKEYS"  RC
            "RC"  RC
   "RED COLOBUS"  COLO
    "RED DUIKER"  DUIK
           "SKS"  SKS
         "SYKES"  SKS
 "SYKES MONKEYS"  SKS
') 

您可能会发现其中有帮助的地方:

spf <- '"%s"'
# spf <- '%s'  ## for the playground (see above)

data.frame(from=
             sprintf(spf, 
                     sort(unique(unlist(
                       strsplit(toupper(dat$Association), regex, perl=TRUE)
                     )))
             )
) |> print(row.names=FALSE)

然后strsplit,用字典替换名字,稍微清理一下:

res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
  lapply(\(x) dict[match(x, dict$from), ]$to) |>
  sapply(toString) |>
  {\(.) replace(., . == ".", NA)}() |>
  data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
  setNames(c('association', 'site', 'species', 'year')) |>
  subset(select=c(3, 1, 2, 4))

res
#    species   association      site year
# 1       RC       SKS, BW Protected   NA
# 2       BW       SKS, RC Protected   NA
# 3      SKS      BABO, BW Protected   NA
# 4       RC    BABO, MANG Protected   NA
# 5     MANG COLO, BW, SKS Protected   NA
# 6      SKS    BABO, DUIK Protected   NA
# 7       BW            RC Protected   12
# 8       RC            BW Protected   12
# 9      SKS        BW, RC Protected   12
# 10      BW       SKS, RC Protected   12
# 11      RC       SKS, BW Protected   12
# 12    BABO          <NA> Protected   12
# 13     SKS          <NA> Protected   12
# 14      RC          <NA> Protected   12
# 15     SKS            RC Protected   12
# 16      RC           SKS Protected   12

注意: R >= 4.1 使用。


数据:

dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS", 
"BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC"
), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey", 
"red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus", 
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus", 
".", ".", ".", "r/c monkeys", "sykes monkeys"), year = c(NA, 
NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1", 
"2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61", 
"7", "8", "9", "10"), class = "data.frame")