查找矩阵中的公共链接并按公共交叉点分类
Finding common links in a matrix and classification by common intersection
假设我有一个距离成本矩阵,其中命运成本和原产地成本都需要低于某个阈值——比如 100 美元——才能共享一个 link。我的困难在于在对这些地点进行分类后获得一个共同的集合:A1 links(命运成本和起源低于阈值)与 A2 和(同样的事情)A3 和 A4; A2 links 与 A1 和 A4; A4 links 与 A1 和 A2。因此,A1、A2 和 A4 将被归入同一组,因为它们之间的 link 频率最高。下面我以一个矩阵为例:
A1 A2 A3 A4 A5 A6 A7
A1 0 90 90 90 100 100 100
A2 80 0 90 90 90 110 100
A3 80 110 0 90 120 110 90
A4 90 90 110 0 90 100 90
A5 110 110 110 110 0 90 80
A6 120 130 135 100 90 0 90
A7 105 110 120 90 90 90 0
我正在用 Stata 对此进行编程,但我没有像 mata
那样以矩阵形式放置上面的矩阵。列出字母 A 加上数字的列是一个具有矩阵行名的变量,其余列以每个地区名称命名(例如 A1 等)。
我已经用下面的代码返回了每个地区之间的 link 的列表,也许我很忙 "bruteforcelly" 因为我很着急:
clear all
set more off
//inputting matrix
input A1 A2 A3 A4 A5 A6 A7
0 90 90 90 100 100 100
80 0 90 90 90 100 100
80 110 0 90 120 110 90
90 90 110 0 90 100 90
110 110 110 110 0 90 90
120 130 135 100 90 0 90
105 110 120 90 90 90 0
end
//generate row variable
gen locality=""
forv i=1/7{
replace locality="A`i'" in `i'
}
*
order locality, first
//generating who gets below the threshold of 100
forv i=1/7{
gen r_`i'=0
replace r_`i'=1 if A`i'<100 & A`i'!=0
}
*
//checking if both ways (origin and destiny below threshold)
forv i=1/7{
gen check_`i'=.
forv j=1/7{
local v=r_`i'[`j']
local vv=r_`j'[`i']
replace check_`i'=`v'+`vv' in `j'
}
*
}
*
//creating list of links
gen locality_x=""
forv i=1/7{
preserve
local name = locality[`i']
keep if check_`i'==2
replace locality_x="`name'"
keep locality locality_x
save "C:\Users\user\Desktop\temp_`i'", replace
restore
}
*
use "C:\Users\user\Desktop\temp_1", clear
forv i=2/7{
append using "C:\Users\user\Desktop\temp_`i'"
}
*
//now locality_x lists if A.1 has links with A.2, A.3 etc. and so on.
//the dificulty lies in finding a common intersection between the groups.
其中returns以下房源:
locality_x locality
A1 A2
A1 A3
A1 A4
A2 A1
A2 A4
A3 A1
A4 A1
A4 A2
A4 A7
A5 A6
A5 A7
A6 A5
A6 A7
A7 A4
A7 A5
A7 A6
我正在尝试熟悉集合交集,但我不知道如何在 Stata 中执行此操作。我想做一些可以重新编程阈值并找到公共集的事情。如果您能在 R 中生成一个解决方案,我将不胜感激,因为我可以在其中进行一些编程。
在 R 中获取列表的类似方法(@user2957945 在下面的回答中给出):
structure(c(0L, 80L, 80L, 90L, 110L, 120L, 105L, 90L, 0L, 110L,
90L, 110L, 130L, 110L, 90L, 90L, 0L, 110L, 110L, 135L, 120L,
90L, 90L, 90L, 0L, 110L, 100L, 90L, 100L, 90L, 120L, 90L, 0L,
90L, 90L, 100L, 110L, 110L, 100L, 90L, 0L, 90L, 100L, 100L, 90L,
90L, 80L, 90L, 0L), .Dim = c(7L, 7L), .Dimnames = list(c("A1",
"A2", "A3", "A4", "A5", "A6", "A7"), c("A1", "A2", "A3", "A4",
"A5", "A6", "A7")))
# get values less than threshold
id = m < 100
# make sure both values are less than threshold, and dont include diagonal
m_new = (id + t(id) == 2) & m !=0
# melt data and subset to keep TRUE values (TRUE if both less than threshold and not on diagonal)
result = subset(reshape2::melt(m_new), value)
# reorder to match question results , if needed
result[order(result[[1]], result[[2]]), 1:2]
Var1 Var2
8 A1 A2
15 A1 A3
22 A1 A4
2 A2 A1
23 A2 A4
3 A3 A1
4 A4 A1
11 A4 A2
46 A4 A7
40 A5 A6
47 A5 A7
34 A6 A5
48 A6 A7
28 A7 A4
35 A7 A5
42 A7 A6
我还添加了 "graph theory" 标签,因为我相信这不完全是交集问题,我可以在向量中转换列表并使用 R 中的 intersect
函数。代码需要产生一个新的id,其中一些地区必须在同一个新id(组)中。如上例,如果A.1的集合有A.2和A.4,A.2有A.1和A.4,A.4有A.1和A.2,这三个地方必须在同一个id(组)。 换句话说,我需要每个地方的最大路口分组。我知道不同的矩阵可能会有问题,例如 A.1 有 A.2 和 A.6,A.2 有 A.1 和 A.6,A.6 有 A.1 和 A.2(但是A.6 没有 A.4,仍然考虑上面的第一个例子)。在那种情况下,我欢迎将 A.6 添加到分组或其他一些任意的解决方案,其中代码只是将第一个集合组合在一起,从列表中删除 A.1、A.2 和 A.4,并且离开 A.6 没有新的分组。
在 R 中你可以做到
# get values less then threshold
id = m < 100
# make sure both values are less then threshold, and dont include diagonal
m_new = (id + t(id) == 2) & m !=0
# melt data and subset to keep TRUE values (TRUE if both less than threshold and not on diagonal)
result = subset(reshape2::melt(m_new), value)
# reorder to match question results , if needed
result[order(result[[1]], result[[2]]), 1:2]
Var1 Var2
8 A1 A2
15 A1 A3
22 A1 A4
2 A2 A1
23 A2 A4
3 A3 A1
4 A4 A1
11 A4 A2
46 A4 A7
40 A5 A6
47 A5 A7
34 A6 A5
48 A6 A7
28 A7 A4
35 A7 A5
42 A7 A6
.
structure(c(0L, 80L, 80L, 90L, 110L, 120L, 105L, 90L, 0L, 110L,
90L, 110L, 130L, 110L, 90L, 90L, 0L, 110L, 110L, 135L, 120L,
90L, 90L, 90L, 0L, 110L, 100L, 90L, 100L, 90L, 120L, 90L, 0L,
90L, 90L, 100L, 110L, 110L, 100L, 90L, 0L, 90L, 100L, 100L, 90L,
90L, 80L, 90L, 0L), .Dim = c(7L, 7L), .Dimnames = list(c("A1",
"A2", "A3", "A4", "A5", "A6", "A7"), c("A1", "A2", "A3", "A4",
"A5", "A6", "A7")))
假设你想要的是最大的完整子图,你可以使用igraph包:
# Load necessary libraries
library(igraph)
# Define global parameters
threshold <- 100
# Compute the adjacency matrix
# (distances in both directions need to be smaller than the threshold)
am <- m < threshold & t(m) < threshold
# Make an undirected graph given the adjacency matrix
# (we set diag to FALSE so as not to draw links from a vertex to itself)
gr <- graph_from_adjacency_matrix(am, mode = "undirected", diag = FALSE)
# Find all the largest complete subgraphs
lc <- largest_cliques(gr)
# Output the list of complete subgraphs as a list of vertex names
lapply(lc, (function (e) e$name))
据我所知,Stata中没有类似的功能。但是,如果您正在寻找最大的连接子图(在您的情况下是整个图),那么您可以在 Stata 中使用聚类命令(即 clustermat
)。
假设我有一个距离成本矩阵,其中命运成本和原产地成本都需要低于某个阈值——比如 100 美元——才能共享一个 link。我的困难在于在对这些地点进行分类后获得一个共同的集合:A1 links(命运成本和起源低于阈值)与 A2 和(同样的事情)A3 和 A4; A2 links 与 A1 和 A4; A4 links 与 A1 和 A2。因此,A1、A2 和 A4 将被归入同一组,因为它们之间的 link 频率最高。下面我以一个矩阵为例:
A1 A2 A3 A4 A5 A6 A7
A1 0 90 90 90 100 100 100
A2 80 0 90 90 90 110 100
A3 80 110 0 90 120 110 90
A4 90 90 110 0 90 100 90
A5 110 110 110 110 0 90 80
A6 120 130 135 100 90 0 90
A7 105 110 120 90 90 90 0
我正在用 Stata 对此进行编程,但我没有像 mata
那样以矩阵形式放置上面的矩阵。列出字母 A 加上数字的列是一个具有矩阵行名的变量,其余列以每个地区名称命名(例如 A1 等)。
我已经用下面的代码返回了每个地区之间的 link 的列表,也许我很忙 "bruteforcelly" 因为我很着急:
clear all
set more off
//inputting matrix
input A1 A2 A3 A4 A5 A6 A7
0 90 90 90 100 100 100
80 0 90 90 90 100 100
80 110 0 90 120 110 90
90 90 110 0 90 100 90
110 110 110 110 0 90 90
120 130 135 100 90 0 90
105 110 120 90 90 90 0
end
//generate row variable
gen locality=""
forv i=1/7{
replace locality="A`i'" in `i'
}
*
order locality, first
//generating who gets below the threshold of 100
forv i=1/7{
gen r_`i'=0
replace r_`i'=1 if A`i'<100 & A`i'!=0
}
*
//checking if both ways (origin and destiny below threshold)
forv i=1/7{
gen check_`i'=.
forv j=1/7{
local v=r_`i'[`j']
local vv=r_`j'[`i']
replace check_`i'=`v'+`vv' in `j'
}
*
}
*
//creating list of links
gen locality_x=""
forv i=1/7{
preserve
local name = locality[`i']
keep if check_`i'==2
replace locality_x="`name'"
keep locality locality_x
save "C:\Users\user\Desktop\temp_`i'", replace
restore
}
*
use "C:\Users\user\Desktop\temp_1", clear
forv i=2/7{
append using "C:\Users\user\Desktop\temp_`i'"
}
*
//now locality_x lists if A.1 has links with A.2, A.3 etc. and so on.
//the dificulty lies in finding a common intersection between the groups.
其中returns以下房源:
locality_x locality
A1 A2
A1 A3
A1 A4
A2 A1
A2 A4
A3 A1
A4 A1
A4 A2
A4 A7
A5 A6
A5 A7
A6 A5
A6 A7
A7 A4
A7 A5
A7 A6
我正在尝试熟悉集合交集,但我不知道如何在 Stata 中执行此操作。我想做一些可以重新编程阈值并找到公共集的事情。如果您能在 R 中生成一个解决方案,我将不胜感激,因为我可以在其中进行一些编程。
在 R 中获取列表的类似方法(@user2957945 在下面的回答中给出):
structure(c(0L, 80L, 80L, 90L, 110L, 120L, 105L, 90L, 0L, 110L,
90L, 110L, 130L, 110L, 90L, 90L, 0L, 110L, 110L, 135L, 120L,
90L, 90L, 90L, 0L, 110L, 100L, 90L, 100L, 90L, 120L, 90L, 0L,
90L, 90L, 100L, 110L, 110L, 100L, 90L, 0L, 90L, 100L, 100L, 90L,
90L, 80L, 90L, 0L), .Dim = c(7L, 7L), .Dimnames = list(c("A1",
"A2", "A3", "A4", "A5", "A6", "A7"), c("A1", "A2", "A3", "A4",
"A5", "A6", "A7")))
# get values less than threshold
id = m < 100
# make sure both values are less than threshold, and dont include diagonal
m_new = (id + t(id) == 2) & m !=0
# melt data and subset to keep TRUE values (TRUE if both less than threshold and not on diagonal)
result = subset(reshape2::melt(m_new), value)
# reorder to match question results , if needed
result[order(result[[1]], result[[2]]), 1:2]
Var1 Var2
8 A1 A2
15 A1 A3
22 A1 A4
2 A2 A1
23 A2 A4
3 A3 A1
4 A4 A1
11 A4 A2
46 A4 A7
40 A5 A6
47 A5 A7
34 A6 A5
48 A6 A7
28 A7 A4
35 A7 A5
42 A7 A6
我还添加了 "graph theory" 标签,因为我相信这不完全是交集问题,我可以在向量中转换列表并使用 R 中的 intersect
函数。代码需要产生一个新的id,其中一些地区必须在同一个新id(组)中。如上例,如果A.1的集合有A.2和A.4,A.2有A.1和A.4,A.4有A.1和A.2,这三个地方必须在同一个id(组)。 换句话说,我需要每个地方的最大路口分组。我知道不同的矩阵可能会有问题,例如 A.1 有 A.2 和 A.6,A.2 有 A.1 和 A.6,A.6 有 A.1 和 A.2(但是A.6 没有 A.4,仍然考虑上面的第一个例子)。在那种情况下,我欢迎将 A.6 添加到分组或其他一些任意的解决方案,其中代码只是将第一个集合组合在一起,从列表中删除 A.1、A.2 和 A.4,并且离开 A.6 没有新的分组。
在 R 中你可以做到
# get values less then threshold
id = m < 100
# make sure both values are less then threshold, and dont include diagonal
m_new = (id + t(id) == 2) & m !=0
# melt data and subset to keep TRUE values (TRUE if both less than threshold and not on diagonal)
result = subset(reshape2::melt(m_new), value)
# reorder to match question results , if needed
result[order(result[[1]], result[[2]]), 1:2]
Var1 Var2
8 A1 A2
15 A1 A3
22 A1 A4
2 A2 A1
23 A2 A4
3 A3 A1
4 A4 A1
11 A4 A2
46 A4 A7
40 A5 A6
47 A5 A7
34 A6 A5
48 A6 A7
28 A7 A4
35 A7 A5
42 A7 A6
.
structure(c(0L, 80L, 80L, 90L, 110L, 120L, 105L, 90L, 0L, 110L,
90L, 110L, 130L, 110L, 90L, 90L, 0L, 110L, 110L, 135L, 120L,
90L, 90L, 90L, 0L, 110L, 100L, 90L, 100L, 90L, 120L, 90L, 0L,
90L, 90L, 100L, 110L, 110L, 100L, 90L, 0L, 90L, 100L, 100L, 90L,
90L, 80L, 90L, 0L), .Dim = c(7L, 7L), .Dimnames = list(c("A1",
"A2", "A3", "A4", "A5", "A6", "A7"), c("A1", "A2", "A3", "A4",
"A5", "A6", "A7")))
假设你想要的是最大的完整子图,你可以使用igraph包:
# Load necessary libraries
library(igraph)
# Define global parameters
threshold <- 100
# Compute the adjacency matrix
# (distances in both directions need to be smaller than the threshold)
am <- m < threshold & t(m) < threshold
# Make an undirected graph given the adjacency matrix
# (we set diag to FALSE so as not to draw links from a vertex to itself)
gr <- graph_from_adjacency_matrix(am, mode = "undirected", diag = FALSE)
# Find all the largest complete subgraphs
lc <- largest_cliques(gr)
# Output the list of complete subgraphs as a list of vertex names
lapply(lc, (function (e) e$name))
据我所知,Stata中没有类似的功能。但是,如果您正在寻找最大的连接子图(在您的情况下是整个图),那么您可以在 Stata 中使用聚类命令(即 clustermat
)。