如何将列转换为矩阵并table按复杂条件存储在列表中[R]

How to convert columns into matrix and table stored in list by complex conditions [R]

我有一个数据框,其中包含家庭一天的旅行信息。

df <- data.frame(
hid=c("10001","10001","10001","10001"),
mid=c(1,2,3,4),
thc=c("010","01010","0","02030"),
mdc=c("000","01010","0","02020"),
thc1=c(0,0,0,0),
thc2=c(1,1,NA,2),
thc3=c(0,0,NA,0),
thc4=c(NA,1,0,3),
thc5=c(NA,0,NA,0),
mdc1=c(0,0,0,0),
mdc2=c(0,1,NA,2),
mdc3=c(0,0,NA,0),
mdc4=c(NA,1,NA,2),
mdc5=c(NA,0,NA,0)
)

hid:家庭id(实际数据框还有更多的家庭)
mid: 家庭成员编号
thc:表示成员日常活动顺序的字符串;
0=房子内部,1=访问过的地方的唯一 ID s/he

因此,如果编码为01020,则表示s/he从家(0)去了地方1,然后回到家(0),去了另一个地方2 从家 (0) 然后在一天内回到家 (0)。

hid 中的 ID 分为每一列,htc1htc2htc3htc4htc5thc 的最大数量是根据家庭移动的最大长度设置的。
如果一个成员最大代码为5,其他成员最大代码为3,则其他成员的htc4和'htc5'由NA填充。

mdc:表示activity在该处所取属性的变量。例如,1=工作,2=学校。它也分为后几列。

现在,我要获取的是一个列表,其中包含 adjacency matrixnode list 中使用的 network analysis,即 igraph,其中包含信息df

这是期望的结果:

# Desired list
[1] # It represents first element grouped by `hid`.
    # In the actual data frame, there are around 40,000
    # households which contains different `hid`.

$hid # `hid` of each record
[1]10001
[2]10001
[3]10001
[4]10001

$mid # `mid` of each record
[1]1
[2]2
[3]3
[4]4

$trip # `adjacency matrix` of each `mid`
      # head of line indicates destination area id
      # leftmost column indicates origin area id
      # for example of [1], 'mid'=1 took 1 trip from 0 to 1 and 1 trip from 1 to 0
[1] # It represents `mid`=1
  0 1
0 0 1
1 1 0
[2] # It represents `mid`=2 
  0 1
0 0 2
1 2 0
[3]
  0
0 0
[4]
  0 1 2 3
0 0 0 1 1
1 0 0 0 0
2 1 0 0 0
3 1 0 0 0

$node # Attribute of each area defined in `mdc'
      # for instance, mdc of `mid`=4, that is `02020`, s/he had activity `2` twice
      # in area id '2' and `3` as indicated in `thc` and `thc1-4`.
      # The number does not indicate "how many times s/he took activity in the area"
     # but indicates "what s/he did in the area"
area mdc1 mdc2 mdc3 mdc4
   0   0    0    0     0
   1   0    1   NA    NA
   2  NA   NA   NA     2
   3  NA   NA   NA     2

[2] # Next element continues same information of other hid
    # Thus, from `hid` to `mdc` are one set of attributes of one element

以我目前对列表和数据转换的了解,从 df 转换为所需列表非常复杂。例如,要创建adjacency matrix,我需要前后引用thc or thc1-5中的信息。对于node,还需要获取'mdc or mdc1-5'中最大的area id和store信息。
如果您能提供任何建议来开始这项工作,我将不胜感激。

我比较喜欢用tidyversepurrr和他们的家人,但是我没有用过purrr进行列表操作。我曾经使用格式化程序进行数据操作,但不熟悉列表操作。

此操作后,我将在 igraph 或其他包(例如 ggnetworknetworkD3 中可视化每个家庭(非成员)的移动和 activity 模式以从每个模式的分布中找到上升模式。

这里有两个可以构建邻接矩阵和activity矩阵的辅助函数:##构建邻接矩阵(评论中的详细信息)

build_adj_mat <- function(thc_) {
  # Convert the factor to numeric for processing
  if (is.factor(thc_)) {
    thc_ <- as.numeric(unlist(strsplit(as.character(thc_), "")))
  }

  # Create a matrix with the correc dimensions, and give names
  mat <- matrix(0, nrow = max(thc_) + 1, ncol = max(thc_) + 1)
  rownames(mat) <- colnames(mat) <- seq(min(thc_), max(thc_))

  # Add to the matrix when appropriate
  for (i in 1:(length(thc_) - 1)) {
    from = thc_[i] + 1
    to = thc_[i + 1] + 1
    mat[from, to] <- mat[from, to] + 1
  }
  return(mat)
}


## Build the activity matrix / node

build_node_df <- function(df_) {
  # get the maximum area length
  max_len <-
    max(as.numeric(unlist(strsplit(
      as.character(df_$thc), ""
    ))))
  # Build the actual matrix function
  build_act_mat <- function(loc_, act_, max = max_len) {
    if (is.factor(loc_)) {
      loc_ <- as.numeric(unlist(strsplit(as.character(loc_), "")))
    }
    if (is.factor(act_)) {
      act_ <- as.numeric(unlist(strsplit(as.character(act_), "")))
    }
    area = rep(NA, max + 1)
    for (i in 1:length(loc_)) {
      area[loc_[i] + 1] <- act_[i]
    }
    return(area)
  }
  # Call the function
  out <- mapply(build_act_mat, df_$thc, df_$mdc)
  # cbind the output with the areas
  out <- data.frame(cbind(0:max_len, out))
  # Assign proper column names
  colnames(out) <- c("area", paste("mid_", df_$mid, sep = ""))
  return(out)
}

然后是一个将这些函数应用于 df 的函数,并为您的 hidmid 输出添加了一些内容:

build_list <- function(dfo) {
  hid_ <- as.numeric(as.character(dfo$hid))
  mid_ <- as.numeric(as.character(dfo$mid))
  trip_ <- lapply(dfo$thc, build_adj_mat)
  node_ <- build_node_df(dfo)

  return(list(
    hid = hid_,
    mid = mid_,
    trip = trip_,
    node = node_)
    )
}

输出:

> build_list(df)
$hid
[1] 10001 10001 10001 10001

$mid
[1] 1 2 3 4

$trip
$trip[[1]]
  0 1
0 0 1
1 1 0

$trip[[2]]
  0 1
0 0 2
1 2 0

$trip[[3]]
  0
0 0

$trip[[4]]
  0 1 2 3
0 0 0 1 1
1 0 0 0 0
2 1 0 0 0
3 1 0 0 0


$node
  area mid_1 mid_2 mid_3 mid_4
1    0     0     0     0     0
2    1     0     1    NA    NA
3    2    NA    NA    NA     2
4    3    NA    NA    NA     2

我确信有办法让它与 dplyr 一起工作,但使用基础 R 中的 split 可能更容易。使用这个稍微修改过的数据框:

df2 <- data.frame(
  hid = c("10001", "10002", "10002", "10003"),
  mid = c(1, 2, 3, 4),
  thc = c("010", "01010", "0", "02030"),
  mdc = c("000", "01010", "0", "02020")
)

现在将新数据框拆分为一个列表,并使用 lapplybuild_list 函数应用于每个片段:

split_df2 <- split(df2, df2$hid)
names(split_df2) <- paste("hid_", names(split_df2), sep = "")
lapply(split_df2, build_list)

输出:

$hid_10001
$hid_10001$hid
[1] 10001

$hid_10001$mid
[1] 1

$hid_10001$trip
$hid_10001$trip[[1]]
  0 1
0 0 1
1 1 0


$hid_10001$node
  area mid_1
1    0     0
2    1     0


$hid_10002
$hid_10002$hid
[1] 10002 10002

$hid_10002$mid
[1] 2 3

$hid_10002$trip
$hid_10002$trip[[1]]
  0 1
0 0 2
1 2 0
...
...

希望能为您指明正确的方向!