使用脏数据集创建邻接矩阵
Creating adjacency matrix with dirty dataset
这是我第二次发布这个问题,由于缺乏可重复性,我删除了第一个问题。
我参考了之前回答的问题(创建邻接矩阵和社交网络图、从原始数据创建邻接矩阵以获得中心性、清理长数据集中的一列)但我在数据清理和创建矩阵之间挣扎.
这是我工作的 df 的一部分 -
Species Association
1 RC SKS/BW
2 BW Sykes, rc
3 SKS Babo/bw
4 RC baboon, mangabey
5 Mang red colobus, bw, sykes
6 SKS babo/red duiker
我正在努力创建一个简单的社交网络矩阵来回答“谁以什么频率与谁联系”。
为了清理数据,我选择了所需的列(物种和关联)并创建了一个列来指示收集此数据的特定站点
df.clean <- mutate(df, Association=fct_collapse(Association,
BW=c("SKS/BW" ,"Babo/bw", "red colobus, bw, sykes"),
RC=c("Sykes, rc" ,"red colobus, bw, sykes"),
SKS=c("SKS/BW", "Sykes, rc", "red colobus, bw, sykes"),
Mang=c("baboon, mangabey"),
BABO=c("Babo/bw", "baboon, mangabey", "babo/red duiker"),
RD=c("babo/red duiker")) %>%
select(Species, Association) %>%
add_column(Site = "Protected") %>%
filter(Species!= "RD", Association!= "RD") %>%
mutate(Species = factor(as.character(Species)))
但是,当我在这一步之后查看“关联”列时,我在整个列中只看到一个物种值(即 bw 而不是 bw,rc)。
我假设我在清理时使用 'fct_collapse()' 函数破坏了我的数据集?我正在寻找这样的输出数据框 -
Species Association Site
1 RC SKS, BW Protected
2 BW SKS, RC Protected
3 SKS BABO, BW Protected
4 RC BABO, Mang Protected
5 Mang RC, BW, SKS Protected
6 SKS BABO Protected
这让我想到了我的第一个问题 - 在保留列中信息的多个值的同时清理像这样的脏数据的最佳方法是什么?我正在尝试创建一个类似于上面示例的数据框,假设我需要将物种和关联列都编码为数值以创建我的矩阵。这会像写的那样工作,还是我需要从列中提取数据并创建新列?
我对 r 比较陌生,所以如果我没有任何意义,请告诉我。非常感谢任何建议,如果有任何混淆,我深表歉意。
根据好心评论者的 运行 代码,我 运行 陷入了关联列的编码问题。一切运行良好,除了,只要有“R/c 猴子”或“B/W 疣猴”,就会有“NA”用于关联,基本上任何时候在关联的命名中都有“/” .这里
对输出样本数据帧进行故障排除
structure(list(Species = structure(c(2L, 4L, 5L, 2L, 4L, 1L,
5L, 4L, 5L, 4L), .Label = c("BABO", "BW", "Mang", "RC", "SKS"
), class = "factor"), Association = c("r/c monkeys", "b/w colobus",
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus",
".", ".", ".", "r/c monkeys", "sykes monkeys"), year = c(12,
12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
Dput 输出看起来像 -
Species Association Year
<fctr> <chr> <dbl>
BW r/c monkeys 12
RC b/w colobus 12
SKS b/w colobus/R/c monkeys 12
BW sykes/R/c monkeys 12
RC sykes/b/w colobus 12
BABO . 12
SKS . 12
RC . 12
SKS r/c monkeys 12
RC sykes monkeys 12
理想dput输出-
Species Association Year
<fctr> <chr> <dbl>
BW RC 12
RC BW 12
SKS BW, RC 12
BW SKS, RC 12
RC SKS, BW 12
BABO NA 12
SKS NA 12
RC NA 12
SKS RC 12
RC SKS 12
使用 strsplit()
和 toString()
。只需使用涵盖所有情况的 正则表达式 ,例如 '(?<=\w{2})\/|,\s'
。享受playground*.
*请注意,那里只需要单转义 \
,而在 R 中需要双转义 \
。
regex <- '(?<=\w{2})\/|,\s'
为了修复不同的名称版本,我们首先使用 toupper()
,这已经消除了一些痛苦。然后使用 dict
ionary,您可以使用 read.table()
、
巧妙地将其粘贴到脚本中
dict <- read.table(header=TRUE, text='
from to
"." .
"B/W COLOBUS" BW
"BABO" BABO
"BABOON" BABO
"BW" BW
"MANGABEY" MANG
"R/C MONKEYS" RC
"RC" RC
"RED COLOBUS" COLO
"RED DUIKER" DUIK
"SKS" SKS
"SYKES" SKS
"SYKES MONKEYS" SKS
')
您可能会发现其中有帮助的地方:
spf <- '"%s"'
# spf <- '%s' ## for the playground (see above)
data.frame(from=
sprintf(spf,
sort(unique(unlist(
strsplit(toupper(dat$Association), regex, perl=TRUE)
)))
)
) |> print(row.names=FALSE)
然后strsplit
,用字典替换名字,稍微清理一下:
res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
lapply(\(x) dict[match(x, dict$from), ]$to) |>
sapply(toString) |>
{\(.) replace(., . == ".", NA)}() |>
data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
setNames(c('association', 'site', 'species', 'year')) |>
subset(select=c(3, 1, 2, 4))
res
# species association site year
# 1 RC SKS, BW Protected NA
# 2 BW SKS, RC Protected NA
# 3 SKS BABO, BW Protected NA
# 4 RC BABO, MANG Protected NA
# 5 MANG COLO, BW, SKS Protected NA
# 6 SKS BABO, DUIK Protected NA
# 7 BW RC Protected 12
# 8 RC BW Protected 12
# 9 SKS BW, RC Protected 12
# 10 BW SKS, RC Protected 12
# 11 RC SKS, BW Protected 12
# 12 BABO <NA> Protected 12
# 13 SKS <NA> Protected 12
# 14 RC <NA> Protected 12
# 15 SKS RC Protected 12
# 16 RC SKS Protected 12
注意: R >= 4.1 使用。
数据:
dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS",
"BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC"
), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey",
"red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus",
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus",
".", ".", ".", "r/c monkeys", "sykes monkeys"), year = c(NA,
NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1",
"2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61",
"7", "8", "9", "10"), class = "data.frame")
这是我第二次发布这个问题,由于缺乏可重复性,我删除了第一个问题。
我参考了之前回答的问题(创建邻接矩阵和社交网络图、从原始数据创建邻接矩阵以获得中心性、清理长数据集中的一列)但我在数据清理和创建矩阵之间挣扎.
这是我工作的 df 的一部分 -
Species Association
1 RC SKS/BW
2 BW Sykes, rc
3 SKS Babo/bw
4 RC baboon, mangabey
5 Mang red colobus, bw, sykes
6 SKS babo/red duiker
我正在努力创建一个简单的社交网络矩阵来回答“谁以什么频率与谁联系”。
为了清理数据,我选择了所需的列(物种和关联)并创建了一个列来指示收集此数据的特定站点
df.clean <- mutate(df, Association=fct_collapse(Association,
BW=c("SKS/BW" ,"Babo/bw", "red colobus, bw, sykes"),
RC=c("Sykes, rc" ,"red colobus, bw, sykes"),
SKS=c("SKS/BW", "Sykes, rc", "red colobus, bw, sykes"),
Mang=c("baboon, mangabey"),
BABO=c("Babo/bw", "baboon, mangabey", "babo/red duiker"),
RD=c("babo/red duiker")) %>%
select(Species, Association) %>%
add_column(Site = "Protected") %>%
filter(Species!= "RD", Association!= "RD") %>%
mutate(Species = factor(as.character(Species)))
但是,当我在这一步之后查看“关联”列时,我在整个列中只看到一个物种值(即 bw 而不是 bw,rc)。
我假设我在清理时使用 'fct_collapse()' 函数破坏了我的数据集?我正在寻找这样的输出数据框 -
Species Association Site
1 RC SKS, BW Protected
2 BW SKS, RC Protected
3 SKS BABO, BW Protected
4 RC BABO, Mang Protected
5 Mang RC, BW, SKS Protected
6 SKS BABO Protected
这让我想到了我的第一个问题 - 在保留列中信息的多个值的同时清理像这样的脏数据的最佳方法是什么?我正在尝试创建一个类似于上面示例的数据框,假设我需要将物种和关联列都编码为数值以创建我的矩阵。这会像写的那样工作,还是我需要从列中提取数据并创建新列?
我对 r 比较陌生,所以如果我没有任何意义,请告诉我。非常感谢任何建议,如果有任何混淆,我深表歉意。
根据好心评论者的 运行 代码,我 运行 陷入了关联列的编码问题。一切运行良好,除了,只要有“R/c 猴子”或“B/W 疣猴”,就会有“NA”用于关联,基本上任何时候在关联的命名中都有“/” .这里
对输出样本数据帧进行故障排除
structure(list(Species = structure(c(2L, 4L, 5L, 2L, 4L, 1L,
5L, 4L, 5L, 4L), .Label = c("BABO", "BW", "Mang", "RC", "SKS"
), class = "factor"), Association = c("r/c monkeys", "b/w colobus",
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus",
".", ".", ".", "r/c monkeys", "sykes monkeys"), year = c(12,
12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
Dput 输出看起来像 -
Species Association Year
<fctr> <chr> <dbl>
BW r/c monkeys 12
RC b/w colobus 12
SKS b/w colobus/R/c monkeys 12
BW sykes/R/c monkeys 12
RC sykes/b/w colobus 12
BABO . 12
SKS . 12
RC . 12
SKS r/c monkeys 12
RC sykes monkeys 12
理想dput输出-
Species Association Year
<fctr> <chr> <dbl>
BW RC 12
RC BW 12
SKS BW, RC 12
BW SKS, RC 12
RC SKS, BW 12
BABO NA 12
SKS NA 12
RC NA 12
SKS RC 12
RC SKS 12
使用 strsplit()
和 toString()
。只需使用涵盖所有情况的 正则表达式 ,例如 '(?<=\w{2})\/|,\s'
。享受playground*.
*请注意,那里只需要单转义 \
,而在 R 中需要双转义 \
。
regex <- '(?<=\w{2})\/|,\s'
为了修复不同的名称版本,我们首先使用 toupper()
,这已经消除了一些痛苦。然后使用 dict
ionary,您可以使用 read.table()
、
dict <- read.table(header=TRUE, text='
from to
"." .
"B/W COLOBUS" BW
"BABO" BABO
"BABOON" BABO
"BW" BW
"MANGABEY" MANG
"R/C MONKEYS" RC
"RC" RC
"RED COLOBUS" COLO
"RED DUIKER" DUIK
"SKS" SKS
"SYKES" SKS
"SYKES MONKEYS" SKS
')
您可能会发现其中有帮助的地方:
spf <- '"%s"'
# spf <- '%s' ## for the playground (see above)
data.frame(from=
sprintf(spf,
sort(unique(unlist(
strsplit(toupper(dat$Association), regex, perl=TRUE)
)))
)
) |> print(row.names=FALSE)
然后strsplit
,用字典替换名字,稍微清理一下:
res <- strsplit(toupper(dat$Association), regex, perl=TRUE) |>
lapply(\(x) dict[match(x, dict$from), ]$to) |>
sapply(toString) |>
{\(.) replace(., . == ".", NA)}() |>
data.frame('Protected', as.factor(toupper(dat$Species)), dat$year) |>
setNames(c('association', 'site', 'species', 'year')) |>
subset(select=c(3, 1, 2, 4))
res
# species association site year
# 1 RC SKS, BW Protected NA
# 2 BW SKS, RC Protected NA
# 3 SKS BABO, BW Protected NA
# 4 RC BABO, MANG Protected NA
# 5 MANG COLO, BW, SKS Protected NA
# 6 SKS BABO, DUIK Protected NA
# 7 BW RC Protected 12
# 8 RC BW Protected 12
# 9 SKS BW, RC Protected 12
# 10 BW SKS, RC Protected 12
# 11 RC SKS, BW Protected 12
# 12 BABO <NA> Protected 12
# 13 SKS <NA> Protected 12
# 14 RC <NA> Protected 12
# 15 SKS RC Protected 12
# 16 RC SKS Protected 12
注意: R >= 4.1 使用。
数据:
dat <- structure(list(Species = c("RC", "BW", "SKS", "RC", "Mang", "SKS",
"BW", "RC", "SKS", "BW", "RC", "BABO", "SKS", "RC", "SKS", "RC"
), Association = c("SKS/BW", "Sykes, rc", "Babo/bw", "baboon, mangabey",
"red colobus, bw, sykes", "babo/red duiker", "r/c monkeys", "b/w colobus",
"b/w colobus/R/c monkeys", "sykes/R/c monkeys", "sykes/b/w colobus",
".", ".", ".", "r/c monkeys", "sykes monkeys"), year = c(NA,
NA, NA, NA, NA, NA, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12)), row.names = c("1",
"2", "3", "4", "5", "6", "11", "21", "31", "41", "51", "61",
"7", "8", "9", "10"), class = "data.frame")