将一行数据分解为多行
Disaggregate one row of data to multiple rows
下午好!
我的数据集有些问题。我正在使用 Google AdWords 导出进行数据分析,我想将逻辑回归模型拟合到数据以确定我进行的实验是否会影响转换。
问题在于数据是聚合的并且能够执行逻辑回归,因变量需要是二元的。因此,我想要 10 个数据点,而不是具有(例如)10 次展示、5 次点击和 2 次转化的数据点,其中 5 个被点击,其中 2 个被转化。
所以我想从一个看起来像这样的数据框开始(非常简单)
| Keyword | Impressions | Clicks | Conversions |
| SampleName | 10 | 5 | 2 |
对此:
| Keyword | Clicked | Converted |
| SampleName | 1 | 1 |
| SampleName | 1 | 1 |
| SampleName | 1 | 0 |
| SampleName | 1 | 0 |
| SampleName | 1 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
我如何才能对非常大的数据集执行此操作?我到处都看过,但似乎找不到解决方案。我更愿意使用 R 来执行此操作,但我还安装了 Excel 和 Stata。
提前致谢!
编辑
这是数据框的一些代码(扩展了额外的行和列)。我对 R 和这个平台还很陌生。这可能不是最干净的编码方式,但它是这样的:
Key <- c("Sample1", "Sample2")
Imp <- c(10, 6)
Cli <- c(5, 3)
Con <- c(2, 1)
CPC <- c(0.26, 0.15)
df1 <- data.frame(Key, Imp, Cli, Con, CPC)
colnames(df1) <- c("Keyword", "Impressions", "Clicks", "Conversions", "CostPerClick")
此外,我现在 运行 遇到这样的问题,即每次点击的平均成本需要针对点击重复,因为每次点击都需要支付价格。所以最后,我需要一个如下所示的数据框:
| Keyword | Clicked | Converted | CPC |
| Sample1 | 1 | 1 | 0.26 |
| Sample1 | 1 | 1 | 0.26 |
| Sample1 | 1 | 0 | 0.26 |
| Sample1 | 1 | 0 | 0.26 |
| Sample1 | 1 | 0 | 0.26 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample2 | 1 | 1 | 0.15 |
| Sample2 | 1 | 0 | 0.15 |
| Sample2 | 1 | 0 | 0.15 |
| Sample2 | 0 | 0 | 0.00 |
| Sample2 | 0 | 0 | 0.00 |
| Sample2 | 0 | 0 | 0.00 |
编辑 2(已解决)
akrun 的解决方案在样本数据集上测试时似乎是正确的,但如果我尝试在我的实际数据集上测试,它会给出以下错误:
> result <- setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
+ Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)),
+ CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
Error in rep(c(1, 0), c(Clicks, Impressions - Clicks)) :
invalid 'times' argument
关键字不包含任何重复且数据不具有 NA:
> length(unique(df1$Keyword))
[1] 186145
> nrow(df1)
[1] 186145
> nrow(df1[complete.cases(df1),]) == nrow(df1)
[1] TRUE
数据汇总:
> summary(df1)
Keyword Impressions Clicks Conversions CostPerClick
Length:186145 Min. : 1.00 Min. : 1.000 Min. :0.00000 Min. :0.010
Class :character 1st Qu.: 7.00 1st Qu.: 1.000 1st Qu.:0.00000 1st Qu.:0.130
Mode :character Median : 16.00 Median : 1.000 Median :0.00000 Median :0.210
Mean : 32.93 Mean : 2.167 Mean :0.03368 Mean :0.246
3rd Qu.: 39.00 3rd Qu.: 2.000 3rd Qu.:0.00000 3rd Qu.:0.320
Max. :1521.00 Max. :91.000 Max. :4.00000 Max. :3.680
尝试
library(data.table)
setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
Converted=rep(c(1,0), c(Conversions, Impressions-Conversions))) , Keyword]
# Keyword Clicked Converted
# 1: SampleName 1 1
# 2: SampleName 1 1
# 3: SampleName 1 0
# 4: SampleName 1 0
# 5: SampleName 1 0
# 6: SampleName 0 0
# 7: SampleName 0 0
# 8: SampleName 0 0
# 9: SampleName 0 0
#10: SampleName 0 0
更新
使用 OP post
中的更新数据集
setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)),
CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
# Keyword Clicked Converted CPC
# 1: Sample1 1 1 0.26
# 2: Sample1 1 1 0.26
# 3: Sample1 1 0 0.26
# 4: Sample1 1 0 0.26
# 5: Sample1 1 0 0.26
# 6: Sample1 0 0 0.00
# 7: Sample1 0 0 0.00
# 8: Sample1 0 0 0.00
# 9: Sample1 0 0 0.00
#10: Sample1 0 0 0.00
#11: Sample2 1 1 0.15
#12: Sample2 1 0 0.15
#13: Sample2 1 0 0.15
#14: Sample2 0 0 0.00
#15: Sample2 0 0 0.00
#16: Sample2 0 0 0.00
数据
df1 <- structure(list(Keyword = "SampleName", Impressions = 10L,
Clicks = 5L,
Conversions = 2L), .Names = c("Keyword", "Impressions", "Clicks",
"Conversions"), class = "data.frame", row.names = c(NA, -1L))
下午好!
我的数据集有些问题。我正在使用 Google AdWords 导出进行数据分析,我想将逻辑回归模型拟合到数据以确定我进行的实验是否会影响转换。
问题在于数据是聚合的并且能够执行逻辑回归,因变量需要是二元的。因此,我想要 10 个数据点,而不是具有(例如)10 次展示、5 次点击和 2 次转化的数据点,其中 5 个被点击,其中 2 个被转化。
所以我想从一个看起来像这样的数据框开始(非常简单)
| Keyword | Impressions | Clicks | Conversions |
| SampleName | 10 | 5 | 2 |
对此:
| Keyword | Clicked | Converted |
| SampleName | 1 | 1 |
| SampleName | 1 | 1 |
| SampleName | 1 | 0 |
| SampleName | 1 | 0 |
| SampleName | 1 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
| SampleName | 0 | 0 |
我如何才能对非常大的数据集执行此操作?我到处都看过,但似乎找不到解决方案。我更愿意使用 R 来执行此操作,但我还安装了 Excel 和 Stata。
提前致谢!
编辑 这是数据框的一些代码(扩展了额外的行和列)。我对 R 和这个平台还很陌生。这可能不是最干净的编码方式,但它是这样的:
Key <- c("Sample1", "Sample2")
Imp <- c(10, 6)
Cli <- c(5, 3)
Con <- c(2, 1)
CPC <- c(0.26, 0.15)
df1 <- data.frame(Key, Imp, Cli, Con, CPC)
colnames(df1) <- c("Keyword", "Impressions", "Clicks", "Conversions", "CostPerClick")
此外,我现在 运行 遇到这样的问题,即每次点击的平均成本需要针对点击重复,因为每次点击都需要支付价格。所以最后,我需要一个如下所示的数据框:
| Keyword | Clicked | Converted | CPC |
| Sample1 | 1 | 1 | 0.26 |
| Sample1 | 1 | 1 | 0.26 |
| Sample1 | 1 | 0 | 0.26 |
| Sample1 | 1 | 0 | 0.26 |
| Sample1 | 1 | 0 | 0.26 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample1 | 0 | 0 | 0.00 |
| Sample2 | 1 | 1 | 0.15 |
| Sample2 | 1 | 0 | 0.15 |
| Sample2 | 1 | 0 | 0.15 |
| Sample2 | 0 | 0 | 0.00 |
| Sample2 | 0 | 0 | 0.00 |
| Sample2 | 0 | 0 | 0.00 |
编辑 2(已解决)
akrun 的解决方案在样本数据集上测试时似乎是正确的,但如果我尝试在我的实际数据集上测试,它会给出以下错误:
> result <- setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
+ Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)),
+ CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
Error in rep(c(1, 0), c(Clicks, Impressions - Clicks)) :
invalid 'times' argument
关键字不包含任何重复且数据不具有 NA:
> length(unique(df1$Keyword))
[1] 186145
> nrow(df1)
[1] 186145
> nrow(df1[complete.cases(df1),]) == nrow(df1)
[1] TRUE
数据汇总:
> summary(df1)
Keyword Impressions Clicks Conversions CostPerClick
Length:186145 Min. : 1.00 Min. : 1.000 Min. :0.00000 Min. :0.010
Class :character 1st Qu.: 7.00 1st Qu.: 1.000 1st Qu.:0.00000 1st Qu.:0.130
Mode :character Median : 16.00 Median : 1.000 Median :0.00000 Median :0.210
Mean : 32.93 Mean : 2.167 Mean :0.03368 Mean :0.246
3rd Qu.: 39.00 3rd Qu.: 2.000 3rd Qu.:0.00000 3rd Qu.:0.320
Max. :1521.00 Max. :91.000 Max. :4.00000 Max. :3.680
尝试
library(data.table)
setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
Converted=rep(c(1,0), c(Conversions, Impressions-Conversions))) , Keyword]
# Keyword Clicked Converted
# 1: SampleName 1 1
# 2: SampleName 1 1
# 3: SampleName 1 0
# 4: SampleName 1 0
# 5: SampleName 1 0
# 6: SampleName 0 0
# 7: SampleName 0 0
# 8: SampleName 0 0
# 9: SampleName 0 0
#10: SampleName 0 0
更新
使用 OP post
中的更新数据集setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)),
CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
# Keyword Clicked Converted CPC
# 1: Sample1 1 1 0.26
# 2: Sample1 1 1 0.26
# 3: Sample1 1 0 0.26
# 4: Sample1 1 0 0.26
# 5: Sample1 1 0 0.26
# 6: Sample1 0 0 0.00
# 7: Sample1 0 0 0.00
# 8: Sample1 0 0 0.00
# 9: Sample1 0 0 0.00
#10: Sample1 0 0 0.00
#11: Sample2 1 1 0.15
#12: Sample2 1 0 0.15
#13: Sample2 1 0 0.15
#14: Sample2 0 0 0.00
#15: Sample2 0 0 0.00
#16: Sample2 0 0 0.00
数据
df1 <- structure(list(Keyword = "SampleName", Impressions = 10L,
Clicks = 5L,
Conversions = 2L), .Names = c("Keyword", "Impressions", "Clicks",
"Conversions"), class = "data.frame", row.names = c(NA, -1L))