使用 R 创建起点-终点矩阵
Creating origin-destination matrices with R
我的数据框由个人和他们在某个时间点居住的城市组成。我想为每一年生成一个起点-终点矩阵,记录从一个城市到另一个城市的移动次数。我想知道:
- 如何在我的数据集中自动生成每年的起点-终点 table?
- 如何以相同的 5x5 格式生成所有 table,在我的示例中 5 是城市的数量?
- 是否有比我在下面提出的更有效的代码?我打算 运行 它在一个非常大的数据集上。
考虑以下示例:
#An example dataframe
id=sample(1:5,50,T)
year=sample(2005:2010,50,T)
city=sample(paste(rep("City",5),1:5,sep=""),50,T)
df=as.data.frame(cbind(id,year,city),stringsAsFactors=F)
df$year=as.numeric(df$year)
df=df[order(df$id,df$year),]
rm(id,year,city)
我尽力了
#Creating variables
for(i in 1:length(df$id)){
df$origin[i]=df$city[i]
df$destination[i]=df$city[i+1]
df$move[i]=ifelse(df$orig[i]!=df$dest[i] & df$id[i]==df$id[i+1],1,0) #Checking whether a move has taken place and whether its the same person
df$year_move[i]=ceiling((df$year[i]+df$year[i+1])/2) #I consider that the person has moved exactly between the two dates at which its location was recorded
}
df=df[df$move!=0,c("origin","destination","year_move")]
为 2007 年创建起点-终点 table
yr07=df[df$year_move==2007,]
table(yr07$origin,yr07$destination)
结果
City1 City2 City3 City5
City1 0 0 1 2
City2 2 0 0 0
City5 1 1 0 0
您可以按 id 拆分数据,对特定 id 数据框执行必要的计算以获取该人的所有动作,然后重新组合:
spl <- split(df, df$id)
move.spl <- lapply(spl, function(x) {
ret <- data.frame(from=head(x$city, -1), to=tail(x$city, -1),
year=ceiling((head(x$year, -1)+tail(x$year, -1))/2),
stringsAsFactors=FALSE)
ret[ret$from != ret$to,]
})
(moves <- do.call(rbind, move.spl))
# from to year
# 1.1 City4 City2 2007
# 1.2 City2 City1 2008
# 1.3 City1 City5 2009
# 1.4 City5 City4 2009
# 1.5 City4 City2 2009
# ...
因为此代码对每个 ID 使用矢量化计算,所以它应该比您在提供的代码中循环遍历数据框的每一行要快得多。
现在您可以使用 split
和 table
获取特定年份的 5x5 移动矩阵:
moves$from <- factor(moves$from)
moves$to <- factor(moves$to)
lapply(split(moves, moves$year), function(x) table(x$from, x$to))
# $`2005`
#
# City1 City2 City3 City4 City5
# City1 0 0 0 0 1
# City2 0 0 0 0 0
# City3 0 0 0 0 0
# City4 0 0 0 0 0
# City5 0 0 1 0 0
#
# $`2006`
#
# City1 City2 City3 City4 City5
# City1 0 0 0 1 0
# City2 0 0 0 0 0
# City3 1 0 0 1 0
# City4 0 0 0 0 0
# City5 2 0 0 0 0
# ...
您可以使用 reshape2 的 dcast 和循环来执行此操作。
library(reshape2)
# write function
write_matrices <- function(year){
mat <- dcast(subset(df, df$year_move == year), origin ~ destination)
print(year)
print(mat)
}
# get unique list of years (there was an NA in there, so that's why this is longer than it needs to be
years <- unique(subset(df, is.na(df$year_move) == FALSE)$year_move)
# loop though and get results
for (year in years){
write_matrices(year)
}
唯一没有解决的是每个矩阵必须有 5*5 的要求,因为如果某些年份没有所有 5 个城市,则只会显示当年的城市。
您可以通过添加一个步骤来解决此问题,该步骤首先将您的观察结果转换为频率 table,因此它们被包括在内但为零。
我的数据框由个人和他们在某个时间点居住的城市组成。我想为每一年生成一个起点-终点矩阵,记录从一个城市到另一个城市的移动次数。我想知道:
- 如何在我的数据集中自动生成每年的起点-终点 table?
- 如何以相同的 5x5 格式生成所有 table,在我的示例中 5 是城市的数量?
- 是否有比我在下面提出的更有效的代码?我打算 运行 它在一个非常大的数据集上。
考虑以下示例:
#An example dataframe
id=sample(1:5,50,T)
year=sample(2005:2010,50,T)
city=sample(paste(rep("City",5),1:5,sep=""),50,T)
df=as.data.frame(cbind(id,year,city),stringsAsFactors=F)
df$year=as.numeric(df$year)
df=df[order(df$id,df$year),]
rm(id,year,city)
我尽力了
#Creating variables
for(i in 1:length(df$id)){
df$origin[i]=df$city[i]
df$destination[i]=df$city[i+1]
df$move[i]=ifelse(df$orig[i]!=df$dest[i] & df$id[i]==df$id[i+1],1,0) #Checking whether a move has taken place and whether its the same person
df$year_move[i]=ceiling((df$year[i]+df$year[i+1])/2) #I consider that the person has moved exactly between the two dates at which its location was recorded
}
df=df[df$move!=0,c("origin","destination","year_move")]
为 2007 年创建起点-终点 table
yr07=df[df$year_move==2007,]
table(yr07$origin,yr07$destination)
结果
City1 City2 City3 City5
City1 0 0 1 2
City2 2 0 0 0
City5 1 1 0 0
您可以按 id 拆分数据,对特定 id 数据框执行必要的计算以获取该人的所有动作,然后重新组合:
spl <- split(df, df$id)
move.spl <- lapply(spl, function(x) {
ret <- data.frame(from=head(x$city, -1), to=tail(x$city, -1),
year=ceiling((head(x$year, -1)+tail(x$year, -1))/2),
stringsAsFactors=FALSE)
ret[ret$from != ret$to,]
})
(moves <- do.call(rbind, move.spl))
# from to year
# 1.1 City4 City2 2007
# 1.2 City2 City1 2008
# 1.3 City1 City5 2009
# 1.4 City5 City4 2009
# 1.5 City4 City2 2009
# ...
因为此代码对每个 ID 使用矢量化计算,所以它应该比您在提供的代码中循环遍历数据框的每一行要快得多。
现在您可以使用 split
和 table
获取特定年份的 5x5 移动矩阵:
moves$from <- factor(moves$from)
moves$to <- factor(moves$to)
lapply(split(moves, moves$year), function(x) table(x$from, x$to))
# $`2005`
#
# City1 City2 City3 City4 City5
# City1 0 0 0 0 1
# City2 0 0 0 0 0
# City3 0 0 0 0 0
# City4 0 0 0 0 0
# City5 0 0 1 0 0
#
# $`2006`
#
# City1 City2 City3 City4 City5
# City1 0 0 0 1 0
# City2 0 0 0 0 0
# City3 1 0 0 1 0
# City4 0 0 0 0 0
# City5 2 0 0 0 0
# ...
您可以使用 reshape2 的 dcast 和循环来执行此操作。
library(reshape2)
# write function
write_matrices <- function(year){
mat <- dcast(subset(df, df$year_move == year), origin ~ destination)
print(year)
print(mat)
}
# get unique list of years (there was an NA in there, so that's why this is longer than it needs to be
years <- unique(subset(df, is.na(df$year_move) == FALSE)$year_move)
# loop though and get results
for (year in years){
write_matrices(year)
}
唯一没有解决的是每个矩阵必须有 5*5 的要求,因为如果某些年份没有所有 5 个城市,则只会显示当年的城市。
您可以通过添加一个步骤来解决此问题,该步骤首先将您的观察结果转换为频率 table,因此它们被包括在内但为零。