如何将具有多个数字的列转换为一系列虚拟变量？

Question

我需要转换一些数据以导入到关系数据库中。在数据库实体中，有一系列 21 yes/no 个变量。在当前格式中，只有一列包含一系列数字，由空格分隔，每个数字对应于该变量的 "yes"。

例如，该列可能显示为“3 7 12 20”，这对应于变量 3、7、12 和 20 的 "Yes"，以及所有其他变量的 "No"。

我需要将该列转换为虚拟变量格式。我知道我可以使用 excel 中的 "Text to Columns" 工具来分隔列中的数字。但这就是我所得到的。如何告诉软件某个数值列对应某个列的某个值？

我希望在 Excel 中做到这一点，但也开始精通 SQL 和 Stata。

谢谢！

Answer 1

这是 Excel 中的一种方法。如果当前数据在A列从A2开始，数字1到21在B1:V1，那么在B2中输入下面的公式，并向下和向右填充根据需要：

=OR(NOT(ISERROR(FIND(" " & B& " ",$A2))),LEFT($A2,LEN(B)+1)=TEXT(B,"@") & " ",RIGHT($A2,LEN(B)+1)=" " & TEXT(B,"@"),TRIM($A2)=TEXT(B,"@"))

这测试四个条件之一：

我们正在寻找的值（即相关列中的值第一行），两边都有 space，可以在 A 列中的单元格（FIND(" " & B& " ",$A2) 不是错误）；或
我们要查找的值加上尾随的 space (TEXT(B,"@") & " ") 是 A 列单元格中的第一个内容 (LEFT($A2,LEN(B)+1))；或
我们要查找的值加上前导 space，是 A 列单元格中的最后一个内容。
我们要查找的值是 A 列单元格中的唯一值。

Answer 2

虽然你没有提到，但我想提供一个 R 中的解决方案。假设以下源数据：

# Load the needed package, load the workbook containing the input data and read the sheet
library(xlsx) 
wb <- loadWorkbook(file="currentFormat.xlsx")
input <- read.xlsx(file="currentFormat.xlsx", sheetIndex=1, startRow=2, header=FALSE, colIndex=1)

# Number of individuals/observations/rows
N <- nrow(input)

# Prepare output data matrix
output <- matrix(0, ncol=21, nrow=N)

# Get 'Yes' answers for each i in N
true <- apply(X=input, 1,FUN=function(z) {as.numeric(unlist(strsplit(z, fixed = TRUE, split = " "))) } )

# Fill the output matrix
for(i in 1:N) {
  output[i, true[[i]]] <- 1
}

# Write output spreadsheet
write.xlsx(x = as.data.frame(output), file = "dummyData.xlsx", sheetName = "Output", row.names = TRUE)

代码不是很漂亮，但它确实希望你问（我猜）：

如何将具有多个数字的列转换为一系列虚拟变量？

How to transform a column with multiple numbers into a series of dummy variables?

database

excel

transformation