将一行数据分解为多行

Disaggregate one row of data to multiple rows

下午好!

我的数据集有些问题。我正在使用 Google AdWords 导出进行数据分析,我想将逻辑回归模型拟合到数据以确定我进行的实验是否会影响转换。

问题在于数据是聚合的并且能够执行逻辑回归,因变量需要是二元的。因此,我想要 10 个数据点,而不是具有(例如)10 次展示、5 次点击和 2 次转化的数据点,其中 5 个被点击,其中 2 个被转化。

所以我想从一个看起来像这样的数据框开始(非常简单)

| Keyword      | Impressions | Clicks     | Conversions |
| SampleName   |      10     |      5     |     2       |

对此:

| Keyword      | Clicked     | Converted   |
| SampleName   |      1      |      1      |
| SampleName   |      1      |      1      |
| SampleName   |      1      |      0      |
| SampleName   |      1      |      0      |
| SampleName   |      1      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |

我如何才能对非常大的数据集执行此操作?我到处都看过,但似乎找不到解决方案。我更愿意使用 R 来执行此操作,但我还安装了 Excel 和 Stata。

提前致谢!

编辑 这是数据框的一些代码(扩展了额外的行和列)。我对 R 和这个平台还很陌生。这可能不是最干净的编码方式,但它是这样的:

Key <- c("Sample1", "Sample2")
Imp <- c(10, 6)
Cli <- c(5, 3)
Con <- c(2, 1)
CPC <- c(0.26, 0.15)
df1 <- data.frame(Key, Imp, Cli, Con, CPC)
colnames(df1) <- c("Keyword", "Impressions", "Clicks", "Conversions", "CostPerClick")

此外,我现在 运行 遇到这样的问题,即每次点击的平均成本需要针对点击重复,因为每次点击都需要支付价格。所以最后,我需要一个如下所示的数据框:

| Keyword   | Clicked     | Converted   |     CPC     |
| Sample1   |      1      |      1      |     0.26    |
| Sample1   |      1      |      1      |     0.26    |
| Sample1   |      1      |      0      |     0.26    |
| Sample1   |      1      |      0      |     0.26    |
| Sample1   |      1      |      0      |     0.26    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample2   |      1      |      1      |     0.15    |
| Sample2   |      1      |      0      |     0.15    |
| Sample2   |      1      |      0      |     0.15    |
| Sample2   |      0      |      0      |     0.00    |
| Sample2   |      0      |      0      |     0.00    |
| Sample2   |      0      |      0      |     0.00    |

编辑 2(已解决)

akrun 的解决方案在样本数据集上测试时似乎是正确的,但如果我尝试在我的实际数据集上测试,它会给出以下错误:

> result <- setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)), 
+  Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)), 
+  CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
Error in rep(c(1, 0), c(Clicks, Impressions - Clicks)) : 
  invalid 'times' argument

关键字不包含任何重复且数据不具有 NA:

> length(unique(df1$Keyword))
[1] 186145
> nrow(df1)
[1] 186145
> nrow(df1[complete.cases(df1),]) == nrow(df1)
[1] TRUE

数据汇总:

> summary(df1)
   Keyword           Impressions          Clicks        Conversions       CostPerClick  
 Length:186145      Min.   :   1.00   Min.   : 1.000   Min.   :0.00000   Min.   :0.010  
 Class :character   1st Qu.:   7.00   1st Qu.: 1.000   1st Qu.:0.00000   1st Qu.:0.130  
 Mode  :character   Median :  16.00   Median : 1.000   Median :0.00000   Median :0.210  
                    Mean   :  32.93   Mean   : 2.167   Mean   :0.03368   Mean   :0.246  
                    3rd Qu.:  39.00   3rd Qu.: 2.000   3rd Qu.:0.00000   3rd Qu.:0.320  
                    Max.   :1521.00   Max.   :91.000   Max.   :4.00000   Max.   :3.680 

尝试

library(data.table)
setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
 Converted=rep(c(1,0), c(Conversions, Impressions-Conversions))) , Keyword]
#       Keyword Clicked Converted
# 1: SampleName       1         1
# 2: SampleName       1         1
# 3: SampleName       1         0
# 4: SampleName       1         0
# 5: SampleName       1         0
# 6: SampleName       0         0
# 7: SampleName       0         0
# 8: SampleName       0         0
# 9: SampleName       0         0
#10: SampleName       0         0

更新

使用 OP post

中的更新数据集
setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)), 
 Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)), 
 CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
#    Keyword Clicked Converted  CPC
# 1: Sample1       1         1 0.26
# 2: Sample1       1         1 0.26
# 3: Sample1       1         0 0.26
# 4: Sample1       1         0 0.26
# 5: Sample1       1         0 0.26
# 6: Sample1       0         0 0.00
# 7: Sample1       0         0 0.00
# 8: Sample1       0         0 0.00
# 9: Sample1       0         0 0.00
#10: Sample1       0         0 0.00
#11: Sample2       1         1 0.15
#12: Sample2       1         0 0.15
#13: Sample2       1         0 0.15
#14: Sample2       0         0 0.00
#15: Sample2       0         0 0.00
#16: Sample2       0         0 0.00

数据

 df1 <- structure(list(Keyword = "SampleName", Impressions = 10L, 
 Clicks = 5L, 
 Conversions = 2L), .Names = c("Keyword", "Impressions", "Clicks", 
 "Conversions"), class = "data.frame", row.names = c(NA, -1L))