合并海量数据导出到"R",无需逐行添加剪切和粘贴
Combine a mass data export into "R" without having to add cut and paste rows one by one
我在一个数据集中有超过 40,000 个观察值,其中包含超过 250 个公司变量和各种数量,涉及会议、与会者、代表、代表人数等。
使用 R 代码我创建了一个只有四个变量的新数据集,我想将其描述性统计数据导出到 Excel 我感兴趣的:
Subset.MergedEx.SO <- mergedex1.SO[, c(10, 72, 73, 120, 121 )]
变量号对应以下列名
mergedex1.SO <- c("sn", "earntot", "earnctot", "meeting65",
"meeting55")
"sn"代表公司名称,其余为各种会议衡量指标、时长、出席人数等变量
之后我将40,000个观察数据集中每个特定公司对应的数据集的子集用五个变量而不是原来的250个变量。
代码如下:
BroomeStreet <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Broome Street'),]
CompanyA <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Company A'),]
CompanyB <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Company B'),]
CompanyC <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Company C'),]
CompanyBC <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Company BC'),]
CompanyCC <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Company CC'),]
等等超过45家公司。
[稍后我将按公司名称和日期创建从 1965 年到 1987 年的子集,这就是为什么我要针对这个日期对所涉及的公司无关紧要的孤立实例提出整个问题]。
我的任务是为 "sn" 列之后的每个变量提取描述性统计数据。我正在寻找标题为 "earntot" 的变量的均值、标准差、最小值、最大值和观察次数;标题为 "earnctot" 的变量的平均值、标准差、最小值、最大值和观察次数,以及变量 "meeting55" 和 "meeting65" 的相同描述性统计。
我能够使用以下代码和特定公式来完成此操作:
EarntotCompanyA <-CompanyA$earntot
EarnctotCompanyA <-CompanyA$earnctot
meet55CompanyA<-CompanyA$meet55
meet65CompanyA <-CompanyA$meet65
CompanyA_ALL_INFORMATION<-cbind(EarntotCompanyA,EarnctotCompanyA,
meet55CompanyA,meet65CompanyA)
library(psych)
info<-describe(CompanyA_ALL_INFORMATION)
n<-info[,2] # vector of total number
mean<-info[,3] # vector of mean
sd<-info[,4] # vector of sd
min<-info[,8] # vector of min
max<-info[,9] # vector of max
#this is ordered by the naming function below
value<-round(c(mean,sd,min,max,n),2)
col.names<-naming(CompanyA_ALL_INFORMATION)
descriptives<-t(as.data.frame(value))
colnames(descriptives)<-col.names
rownames(descriptives)<-"Company A"
library(xlsx)
write.xlsx(descriptives, "descriptives.CompanyA.xlsx")
完成此操作后,我在 Excel 中得到一行,其中包含我之前需要的信息。
然后我继续执行与上述完全相同的步骤,除了使用不同的公司来获取另一个单独的文件,例如 "descriptive.CompanyB.xlsx"、"descriptives.CompanyC.xlsx"、....
我从打开的 50 多个 excel windows 中的每一行剪切并粘贴所有行,并将它们组合到另一个单独的 excel window 中,其中包含所有我想要的信息。
单行示例如下所示:
average.number.of.EarntotCompanyA average.number.of.EarnctotCompanyA
average.number.of.meet55CompanyA average.number.of.meet65CompanyA standard.deviation.of.EarntotCompanyA standard.deviation.of.EarnctotCompanyA standard.deviation.of.meet55CompanyA standard.deviation.of.meet65CompanyA min.number.of.EarntotCompanyA min.number.of.EarnctotCompanyA min.number.of.meet55CompanyA min.number.of.meet65CompanyA max.number.of.EarntotCompanyA max.number.of.EarnctotCompanyA max.number.of.meet55CompanyA max.number.of.meet65CompanyA total.number.of.EarntotCompanyA total.number.of.EarnctotCompanyA total.number.of.meet55CompanyA total.number.of.meet65CompanyA
Company A 16.58 22.91 1 1.85 15.68 16.81 1.75 2.34 0
0 0 0 84.11 92.11 5 9 176 176 69 229
我怎样才能让所有行都出现在一个文件中,而不必单独获取每一行,必须从每个 excel 文件中剪切并粘贴它,然后将其粘贴到一个单独的文件中。我已经在后台打开了 50 多个 excel 文件,其中包含我需要的精确信息,但一次只能使用一个。
下面是一个可重现的数据示例:
> dput((head(Subset.MergedEx.SO, 120)))
structure(list(sn = structure(c(2L, 2L, 3L, 5L, 2L, 7L, 1L, 9L,
1L, 9L, NA, 9L, 1L, 26L, 11L, 9L, 7L, NA, NA, 7L, 17L, 9L, NA,
21L, 7L, 17L, 7L, 7L, 16L, 7L, 7L, 7L, 7L, 26L, 7L, 6L, 26L,
22L, NA, NA, 11L, 23L, 23L, 26L, NA, 7L, 23L, 1L, NA, 1L, 7L,
11L, 12L, 13L, 9L, NA, 15L, NA, 20L, 15L, NA, 17L, 5L, NA, 22L,
15L, NA, NA, 5L, 8L, 32L, 29L, 23L, 33L, 1L, 23L, 14L, 6L, 7L,
15L, 15L, 29L, NA, 21L, 6L, 35L, 32L, 32L, 7L, 31L, 23L, 23L,
1L, 29L, 34L, 34L, 34L, 17L, 24L, 24L, 24L, 24L, 7L, 16L, 7L,
23L, 23L, 34L, 29L, 15L, NA, 35L, 24L, 27L, 33L, 35L, 10L, 34L,
33L, 34L), .Label = c("Broome Street", "Company A", "Company B",
"Company BC", "Company C", "Company CC", "Company D Clinton",
"Company DD", "Company E", "Company ED BroadCompany", "Company G",
"Company H BroadCompany", "Company I BroadCompany", "Company I Studio",
"Company J", "Company K", "Company L", "Company M", "Company M
BroadCompany",
"Company M HS BroadCompany", "Company MCC BroadCompany", "Company N",
"Company P", "Company Q", "Company Q Company N", "Company Q Company ZZ",
"Company R - Company ZZ", "Company SLab", "Company Z", "Company ZE",
"Company ZED", "Company ZEQ", "Company ZZ", "Company ZZQ", "Company ZZQ
Company N"), class = "factor"), earntot = c(21.85, 20.8, NA, 8.16, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 7.16, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 43.32, NA, 30.48, NA, NA, 34.9, NA, NA, NA,
NA, NA, 25.82, 40.75, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0,
NA, NA, NA, 30, NA, NA, NA, NA, NA, NA, 39.1, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 52.29, 44.32, NA, 7, 38.32, 0, NA, NA,
8.25, NA, NA, NA, NA, NA, 51.12, 39.9, NA, 37.48, 32.74, NA,
NA, NA, 33.4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 30.82,
NA, NA, NA, NA, NA, 5.74, NA, NA, NA, NA, NA, NA, NA, NA, 44.48,
NA), earnctot = c(29.43, 20.8, NA, 8.16, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 7.16, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 49.9, NA, 37.56, NA, NA, 41.98, NA, NA, NA, NA, NA, 37.32,
49, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, 37, NA,
NA, NA, NA, NA, NA, 47.68, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 57.29, 48.48, NA, 7, 45.9, 0, NA, NA, 15.75, NA, NA, NA,
NA, NA, 54.12, 46.65, NA, 45.56, 39.9, NA, NA, NA, 39.98, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 38.4, NA, NA, NA, NA,
NA, 12.9, NA, NA, NA, NA, NA, NA, NA, NA, 52.06, NA), meet55 = c(0L,
0L, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 5L, NA, 0L, NA, 5L, NA,
NA, NA, 0L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 0L, NA, NA, NA, NA, 5L, NA, NA, NA, NA, 4L, 0L, NA,
NA, NA, 4L, 4L, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, 1L, NA, NA, 0L, 4L, 0L, NA, NA, 0L, NA, NA, NA,
NA, NA, 4L, 3L, 5L, NA, NA, NA, 1L, NA, 0L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 5L, NA, NA, NA, NA, NA, 0L, NA, 0L, NA,
NA, NA, NA, NA, NA, NA, NA), meet65 = c(0L, 0L, 5L, 0L, 6L, NA,
0L, 5L, NA, 5L, NA, 6L, NA, 0L, 5L, 2L, NA, NA, NA, 0L, 5L, 5L,
NA, NA, NA, 0L, NA, 1L, 4L, 7L, 5L, 5L, 7L, 0L, 5L, NA, 0L, 1L,
NA, NA, NA, 2L, 0L, 6L, NA, 8L, 2L, 0L, NA, 4L, 0L, 1L, 3L, NA,
NA, NA, NA, NA, 4L, 0L, NA, 5L, 7L, NA, 0L, NA, NA, NA, 5L, 0L,
5L, 4L, 0L, 2L, 0L, 0L, 7L, 0L, NA, 5L, NA, 8L, NA, 0L, 1L, 7L,
0L, 4L, 7L, 0L, 3L, 0L, NA, NA, 7L, 5L, 8L, 5L, 5L, 6L, 5L, 6L,
5L, 2L, 0L, 8L, 7L, 7L, 5L, 0L, NA, 0L, 6L, NA, 8L, 8L, 5L, 7L,
7L, 6L)), .Names = c("sn", "earntot", "earnctot", "meet55", "meet65"
), row.names = c(NA, 120L), class = "data.frame")
我建议
# install.packages("dplyr") # uncomment and run if you have to
library(dplyr)
Subset.MergedEx.SO %>% group_by(sn) %>%
summarise_each(funs(n(), mean(., na.rm = TRUE), sd(., na.rm = TRUE), min(., na.rm = TRUE), max(., na.rm = TRUE))) %>%
write.csv2(tf <<- tempfile(fileext = ".csv"))
cat(tf) # open that file in excel
您可能需要根据您的 Excel/OS 配置调整 write.csv2
(即使用 write.csv
或 write.table
和 sep="\t"
)。
我在一个数据集中有超过 40,000 个观察值,其中包含超过 250 个公司变量和各种数量,涉及会议、与会者、代表、代表人数等。
使用 R 代码我创建了一个只有四个变量的新数据集,我想将其描述性统计数据导出到 Excel 我感兴趣的:
Subset.MergedEx.SO <- mergedex1.SO[, c(10, 72, 73, 120, 121 )]
变量号对应以下列名
mergedex1.SO <- c("sn", "earntot", "earnctot", "meeting65",
"meeting55")
"sn"代表公司名称,其余为各种会议衡量指标、时长、出席人数等变量
之后我将40,000个观察数据集中每个特定公司对应的数据集的子集用五个变量而不是原来的250个变量。
代码如下:
BroomeStreet <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Broome Street'),]
CompanyA <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Company A'),]
CompanyB <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Company B'),]
CompanyC <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Company C'),]
CompanyBC <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Company BC'),]
CompanyCC <- Subset.MergedEx.SO[ which(Subset.MergedEx.SO$sn=='Company CC'),]
等等超过45家公司。 [稍后我将按公司名称和日期创建从 1965 年到 1987 年的子集,这就是为什么我要针对这个日期对所涉及的公司无关紧要的孤立实例提出整个问题]。
我的任务是为 "sn" 列之后的每个变量提取描述性统计数据。我正在寻找标题为 "earntot" 的变量的均值、标准差、最小值、最大值和观察次数;标题为 "earnctot" 的变量的平均值、标准差、最小值、最大值和观察次数,以及变量 "meeting55" 和 "meeting65" 的相同描述性统计。
我能够使用以下代码和特定公式来完成此操作:
EarntotCompanyA <-CompanyA$earntot
EarnctotCompanyA <-CompanyA$earnctot
meet55CompanyA<-CompanyA$meet55
meet65CompanyA <-CompanyA$meet65
CompanyA_ALL_INFORMATION<-cbind(EarntotCompanyA,EarnctotCompanyA,
meet55CompanyA,meet65CompanyA)
library(psych)
info<-describe(CompanyA_ALL_INFORMATION)
n<-info[,2] # vector of total number
mean<-info[,3] # vector of mean
sd<-info[,4] # vector of sd
min<-info[,8] # vector of min
max<-info[,9] # vector of max
#this is ordered by the naming function below
value<-round(c(mean,sd,min,max,n),2)
col.names<-naming(CompanyA_ALL_INFORMATION)
descriptives<-t(as.data.frame(value))
colnames(descriptives)<-col.names
rownames(descriptives)<-"Company A"
library(xlsx)
write.xlsx(descriptives, "descriptives.CompanyA.xlsx")
完成此操作后,我在 Excel 中得到一行,其中包含我之前需要的信息。
然后我继续执行与上述完全相同的步骤,除了使用不同的公司来获取另一个单独的文件,例如 "descriptive.CompanyB.xlsx"、"descriptives.CompanyC.xlsx"、....
我从打开的 50 多个 excel windows 中的每一行剪切并粘贴所有行,并将它们组合到另一个单独的 excel window 中,其中包含所有我想要的信息。
单行示例如下所示:
average.number.of.EarntotCompanyA average.number.of.EarnctotCompanyA
average.number.of.meet55CompanyA average.number.of.meet65CompanyA standard.deviation.of.EarntotCompanyA standard.deviation.of.EarnctotCompanyA standard.deviation.of.meet55CompanyA standard.deviation.of.meet65CompanyA min.number.of.EarntotCompanyA min.number.of.EarnctotCompanyA min.number.of.meet55CompanyA min.number.of.meet65CompanyA max.number.of.EarntotCompanyA max.number.of.EarnctotCompanyA max.number.of.meet55CompanyA max.number.of.meet65CompanyA total.number.of.EarntotCompanyA total.number.of.EarnctotCompanyA total.number.of.meet55CompanyA total.number.of.meet65CompanyA
Company A 16.58 22.91 1 1.85 15.68 16.81 1.75 2.34 0
0 0 0 84.11 92.11 5 9 176 176 69 229
我怎样才能让所有行都出现在一个文件中,而不必单独获取每一行,必须从每个 excel 文件中剪切并粘贴它,然后将其粘贴到一个单独的文件中。我已经在后台打开了 50 多个 excel 文件,其中包含我需要的精确信息,但一次只能使用一个。
下面是一个可重现的数据示例:
> dput((head(Subset.MergedEx.SO, 120)))
structure(list(sn = structure(c(2L, 2L, 3L, 5L, 2L, 7L, 1L, 9L,
1L, 9L, NA, 9L, 1L, 26L, 11L, 9L, 7L, NA, NA, 7L, 17L, 9L, NA,
21L, 7L, 17L, 7L, 7L, 16L, 7L, 7L, 7L, 7L, 26L, 7L, 6L, 26L,
22L, NA, NA, 11L, 23L, 23L, 26L, NA, 7L, 23L, 1L, NA, 1L, 7L,
11L, 12L, 13L, 9L, NA, 15L, NA, 20L, 15L, NA, 17L, 5L, NA, 22L,
15L, NA, NA, 5L, 8L, 32L, 29L, 23L, 33L, 1L, 23L, 14L, 6L, 7L,
15L, 15L, 29L, NA, 21L, 6L, 35L, 32L, 32L, 7L, 31L, 23L, 23L,
1L, 29L, 34L, 34L, 34L, 17L, 24L, 24L, 24L, 24L, 7L, 16L, 7L,
23L, 23L, 34L, 29L, 15L, NA, 35L, 24L, 27L, 33L, 35L, 10L, 34L,
33L, 34L), .Label = c("Broome Street", "Company A", "Company B",
"Company BC", "Company C", "Company CC", "Company D Clinton",
"Company DD", "Company E", "Company ED BroadCompany", "Company G",
"Company H BroadCompany", "Company I BroadCompany", "Company I Studio",
"Company J", "Company K", "Company L", "Company M", "Company M
BroadCompany",
"Company M HS BroadCompany", "Company MCC BroadCompany", "Company N",
"Company P", "Company Q", "Company Q Company N", "Company Q Company ZZ",
"Company R - Company ZZ", "Company SLab", "Company Z", "Company ZE",
"Company ZED", "Company ZEQ", "Company ZZ", "Company ZZQ", "Company ZZQ
Company N"), class = "factor"), earntot = c(21.85, 20.8, NA, 8.16, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 7.16, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 43.32, NA, 30.48, NA, NA, 34.9, NA, NA, NA,
NA, NA, 25.82, 40.75, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0,
NA, NA, NA, 30, NA, NA, NA, NA, NA, NA, 39.1, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 52.29, 44.32, NA, 7, 38.32, 0, NA, NA,
8.25, NA, NA, NA, NA, NA, 51.12, 39.9, NA, 37.48, 32.74, NA,
NA, NA, 33.4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 30.82,
NA, NA, NA, NA, NA, 5.74, NA, NA, NA, NA, NA, NA, NA, NA, 44.48,
NA), earnctot = c(29.43, 20.8, NA, 8.16, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 7.16, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 49.9, NA, 37.56, NA, NA, 41.98, NA, NA, NA, NA, NA, 37.32,
49, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, 37, NA,
NA, NA, NA, NA, NA, 47.68, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 57.29, 48.48, NA, 7, 45.9, 0, NA, NA, 15.75, NA, NA, NA,
NA, NA, 54.12, 46.65, NA, 45.56, 39.9, NA, NA, NA, 39.98, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 38.4, NA, NA, NA, NA,
NA, 12.9, NA, NA, NA, NA, NA, NA, NA, NA, 52.06, NA), meet55 = c(0L,
0L, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 5L, NA, 0L, NA, 5L, NA,
NA, NA, 0L, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 0L, NA, NA, NA, NA, 5L, NA, NA, NA, NA, 4L, 0L, NA,
NA, NA, 4L, 4L, NA, NA, NA, NA, NA, NA, 0L, NA, NA, NA, NA, 1L,
NA, NA, NA, NA, 1L, NA, NA, 0L, 4L, 0L, NA, NA, 0L, NA, NA, NA,
NA, NA, 4L, 3L, 5L, NA, NA, NA, 1L, NA, 0L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 5L, NA, NA, NA, NA, NA, 0L, NA, 0L, NA,
NA, NA, NA, NA, NA, NA, NA), meet65 = c(0L, 0L, 5L, 0L, 6L, NA,
0L, 5L, NA, 5L, NA, 6L, NA, 0L, 5L, 2L, NA, NA, NA, 0L, 5L, 5L,
NA, NA, NA, 0L, NA, 1L, 4L, 7L, 5L, 5L, 7L, 0L, 5L, NA, 0L, 1L,
NA, NA, NA, 2L, 0L, 6L, NA, 8L, 2L, 0L, NA, 4L, 0L, 1L, 3L, NA,
NA, NA, NA, NA, 4L, 0L, NA, 5L, 7L, NA, 0L, NA, NA, NA, 5L, 0L,
5L, 4L, 0L, 2L, 0L, 0L, 7L, 0L, NA, 5L, NA, 8L, NA, 0L, 1L, 7L,
0L, 4L, 7L, 0L, 3L, 0L, NA, NA, 7L, 5L, 8L, 5L, 5L, 6L, 5L, 6L,
5L, 2L, 0L, 8L, 7L, 7L, 5L, 0L, NA, 0L, 6L, NA, 8L, 8L, 5L, 7L,
7L, 6L)), .Names = c("sn", "earntot", "earnctot", "meet55", "meet65"
), row.names = c(NA, 120L), class = "data.frame")
我建议
# install.packages("dplyr") # uncomment and run if you have to
library(dplyr)
Subset.MergedEx.SO %>% group_by(sn) %>%
summarise_each(funs(n(), mean(., na.rm = TRUE), sd(., na.rm = TRUE), min(., na.rm = TRUE), max(., na.rm = TRUE))) %>%
write.csv2(tf <<- tempfile(fileext = ".csv"))
cat(tf) # open that file in excel
您可能需要根据您的 Excel/OS 配置调整 write.csv2
(即使用 write.csv
或 write.table
和 sep="\t"
)。