矢量化和循环版本 return 不同的答案
Vectorized and looped versions return different answers
我是运行产品扩散模拟研究。模拟从节点网络开始,并为少量初始节点播种产品。播种阶段之后的扩散受概率规则支配,这取决于采用该产品的节点的邻居数量。我用 R 写了这个模型的两个版本——一个是矢量化的,另一个是循环的。这个想法可能用代码更好地表达。
library(igraph)
set.seed(20130810)
g <- sample_smallworld(dim = 1, size = 1000, nei = 12, p = 0.6)
n.nodes <- length(V(g))
nbr.influence <- rnorm(n = n.nodes, mean = 0.18, sd = 0.01)
# Diffusion simulation with loops
nodes.status <- rep.int(0, n.nodes)
seed <- sample(V(g), size = as.integer(0.005*n.nodes))
nodes.status[seed] <- 1
cat("Number of nodes seeded (loop version): ", sum(nodes.status), "\n")
for (node in V(g)) {
if (nodes.status[node] != 1) {
n.active.nbrs = 0
for (nbr in neighbors(g, node)) {
if (nodes.status[nbr] == 1) n.active.nbrs <- n.active.nbrs + 1
}
prob.change <- 1 - (1 - nbr.influence[node])^n.active.nbrs
if (runif(n = 1) < prob.change) nodes.status[node] = 1
}
}
cat("Number of nodes engaged after one iteration (loop version): ",
sum(nodes.status), "\n")
# Vectorized diffusion simulation
A <- get.adjacency(g)
nodes.status <- rep.int(0, n.nodes)
seed <- sample(V(g), size = as.integer(0.005*n.nodes))
nodes.status[seed] <- 1
cat("Number of nodes seeded (vectorized version): ", sum(nodes.status), "\n")
# use the adjacency matrix to count number of active neighbors for each node
n.active.neighbours <- as.vector(A %*% nodes.status)
# build the activation probability vector
prob.change <- 1 - (1 - nbr.influence)^n.active.neighbours
# see which of the nodes are ready to activate
vuln.nodes <- runif(n = n.nodes) < prob.change
# activate those nodes which are ready
nodes.status[vuln.nodes > nodes.status] <- 1
cat("Number of nodes engaged after one iteration (vectorized version): ",
sum(nodes.status), "\n")
运行 此代码给出以下输出
Number of nodes seeded (loop version): 5
Number of nodes engaged after one iteration (loop version): 380
Number of nodes seeded (vectorized version): 5
Number of nodes engaged after one iteration (vectorized version): 32
两个版本的逻辑是一样的(即扩散是按相同的概率规则),但最终的答案却大相径庭。这段代码的错误在哪里?
您的 for 循环和 "vectorised" 版本正在做两件截然不同的事情。在 for
循环的所有 1000 次迭代中,您更新(总是增加,或至少不减少)激活概率向量。在向量化版本中,所有 1000 次迭代都已完成 "at the same time",因此所有概率都使用相同的向量计算。
例如,在 for 循环的第一次迭代中,您计算第一个节点被更新的概率。如果是这样,那么它会影响第二个节点更新的概率。在矢量化版本中,无论第一个节点是否更新,第二个节点更新的概率都不会受到影响。
如果要使两者相同,则必须使 for
循环在计算新状态时保留所有其他节点的原始状态。这是一个例子:
cat("Number of nodes seeded (loop version): ", sum(nodes.status), "\n")
new.nodes.status <- nodes.status # copy vector to preserve original state.
for (node in V(g)) {
if (nodes.status[node] != 1) {
n.active.nbrs = 0
for (nbr in neighbors(g, node)) {
if (nodes.status[nbr] == 1) n.active.nbrs <- n.active.nbrs + 1
}
prob.change <- 1 - (1 - nbr.influence[node])^n.active.nbrs
if (runif(n = 1) < prob.change) new.nodes.status[node] = 1 # Only update new.
}
}
cat("Number of nodes engaged after one iteration (loop version): ",
sum(new.nodes.status), "\n")
这让我很感动 32
。但是你也应该在每次 运行 得到完全相同的东西之前设置你的种子。
我是运行产品扩散模拟研究。模拟从节点网络开始,并为少量初始节点播种产品。播种阶段之后的扩散受概率规则支配,这取决于采用该产品的节点的邻居数量。我用 R 写了这个模型的两个版本——一个是矢量化的,另一个是循环的。这个想法可能用代码更好地表达。
library(igraph)
set.seed(20130810)
g <- sample_smallworld(dim = 1, size = 1000, nei = 12, p = 0.6)
n.nodes <- length(V(g))
nbr.influence <- rnorm(n = n.nodes, mean = 0.18, sd = 0.01)
# Diffusion simulation with loops
nodes.status <- rep.int(0, n.nodes)
seed <- sample(V(g), size = as.integer(0.005*n.nodes))
nodes.status[seed] <- 1
cat("Number of nodes seeded (loop version): ", sum(nodes.status), "\n")
for (node in V(g)) {
if (nodes.status[node] != 1) {
n.active.nbrs = 0
for (nbr in neighbors(g, node)) {
if (nodes.status[nbr] == 1) n.active.nbrs <- n.active.nbrs + 1
}
prob.change <- 1 - (1 - nbr.influence[node])^n.active.nbrs
if (runif(n = 1) < prob.change) nodes.status[node] = 1
}
}
cat("Number of nodes engaged after one iteration (loop version): ",
sum(nodes.status), "\n")
# Vectorized diffusion simulation
A <- get.adjacency(g)
nodes.status <- rep.int(0, n.nodes)
seed <- sample(V(g), size = as.integer(0.005*n.nodes))
nodes.status[seed] <- 1
cat("Number of nodes seeded (vectorized version): ", sum(nodes.status), "\n")
# use the adjacency matrix to count number of active neighbors for each node
n.active.neighbours <- as.vector(A %*% nodes.status)
# build the activation probability vector
prob.change <- 1 - (1 - nbr.influence)^n.active.neighbours
# see which of the nodes are ready to activate
vuln.nodes <- runif(n = n.nodes) < prob.change
# activate those nodes which are ready
nodes.status[vuln.nodes > nodes.status] <- 1
cat("Number of nodes engaged after one iteration (vectorized version): ",
sum(nodes.status), "\n")
运行 此代码给出以下输出
Number of nodes seeded (loop version): 5
Number of nodes engaged after one iteration (loop version): 380
Number of nodes seeded (vectorized version): 5
Number of nodes engaged after one iteration (vectorized version): 32
两个版本的逻辑是一样的(即扩散是按相同的概率规则),但最终的答案却大相径庭。这段代码的错误在哪里?
您的 for 循环和 "vectorised" 版本正在做两件截然不同的事情。在 for
循环的所有 1000 次迭代中,您更新(总是增加,或至少不减少)激活概率向量。在向量化版本中,所有 1000 次迭代都已完成 "at the same time",因此所有概率都使用相同的向量计算。
例如,在 for 循环的第一次迭代中,您计算第一个节点被更新的概率。如果是这样,那么它会影响第二个节点更新的概率。在矢量化版本中,无论第一个节点是否更新,第二个节点更新的概率都不会受到影响。
如果要使两者相同,则必须使 for
循环在计算新状态时保留所有其他节点的原始状态。这是一个例子:
cat("Number of nodes seeded (loop version): ", sum(nodes.status), "\n")
new.nodes.status <- nodes.status # copy vector to preserve original state.
for (node in V(g)) {
if (nodes.status[node] != 1) {
n.active.nbrs = 0
for (nbr in neighbors(g, node)) {
if (nodes.status[nbr] == 1) n.active.nbrs <- n.active.nbrs + 1
}
prob.change <- 1 - (1 - nbr.influence[node])^n.active.nbrs
if (runif(n = 1) < prob.change) new.nodes.status[node] = 1 # Only update new.
}
}
cat("Number of nodes engaged after one iteration (loop version): ",
sum(new.nodes.status), "\n")
这让我很感动 32
。但是你也应该在每次 运行 得到完全相同的东西之前设置你的种子。