复杂子集数据集到数据框
Complex subsetting data set to data frame
1) 我想在 Gnu R 中对数据集 here 进行子集操作,以获得仅包含巴西、时间和所有内容的结果数据框关于收入份额的系列名称(如 "Income share held by lowest 10%"、“"Income share held by lowest 20%" 等等)。总共会有 7 个关于收入份额的系列名称。
我尝试了以下命令,但无法对多个子集进行子集化 "Series.Name":
test <- melt(subset(WDI, subset = Series.Name == "Income share held by lowest 10%", select = -c(Time.Code, Series.Code, Argentina, Canada, Chile, Colombia, Mexico, USA, Venezuela)), id.vars = c("Series.Name", "Time"))
2) 在第二步中,我想删除所有具有 NA 值的行。
我使用的完整代码如下:
WDI <- read.csv(https://dl.dropboxusercontent.com/u/109495328/WDI_Data_final.csv, na.strings = "..")
library(reshape)
library(reshape2)
WDI <- rename(WDI, (c(Argentina..ARG.="Argentina", Brazil..BRA.="Brazil", Canada..CAN.="Canada", Chile..CHL.="Chile", Colombia..COL.="Colombia", Mexico..MEX.="Mexico", United.States..USA.="USA", Venezuela..RB..VEN.="Venezuela")))
income_brazil_long <- melt(subset(WDI, subset = Series.Name == "Income share held by lowest 10%", select = -c(Time.Code, Series.Code, Argentina, Canada, Chile, Colombia, Mexico, USA, Venezuela)), id.vars = c("Series.Name", "Time"))
好吧,您可以使用 base
函数来完成您想要的事情。
WDI <- read.csv("WDI_Data_final.csv", header=T, na.strings="..")
# The colnames are strange from the file so reset for clarity
colnames(WDI) <- c("Series.Name", "Series.Code", "Time","Time.Code","Argentina",
"Brazil", "Canada", "Chile", "Colombia","Mexico",
"USA", "Venezuela")
# do the subsetting
test <- with(WDI,
WDI[Series.Name=="Income share held by lowest 10%",
c("Brazil","Time", "Series.Name")])
# if you want more, use %in% and specify the Series.Names you care about
test <- with(WDI,
WDI[Series.Name %in% c("Income share held by lowest 10%",
"Income share held by lowest 20%"),
c("Brazil","Time", "Series.Name")])
# if you want all the 'income shares', the grepl solution above by
# Ananda is the most concise.
# you can then use reshape2::melt
melted_test <- melt(test, id.vars=c("Series.Name", "Time"))
要删除 NA
只需使用 complete.cases
test[complete.cases(test),]
查看您的数据,这实际上可能是使用 grepl
帮助进行子集化最简单的方法。
我们使用 grepl
在 "Series.Name" 列中搜索包含字符串 "Income share held" 的任何行。这创建了一个逻辑向量,指示我们想要的行。我们要的列是第一列,第三列,第六列。
将其全部包装在 na.omit
中以删除任何具有 NA
值的行。
WDI_Brazil <- na.omit(WDI[grepl("Income share held", WDI$Series.Name),
c(1, 3, 6)])
数据已经"long",没必要melt
。 data.frame
是什么样子的?
summary(WDI_Brazil)
# Series.Name Time Brazil..BRA.
# Income share held by fourth 20% :28 Min. :1981 Min. : 0.600
# Income share held by highest 10%:28 1st Qu.:1988 1st Qu.: 2.895
# Income share held by highest 20%:28 Median :1996 Median :10.320
# Income share held by lowest 10% :28 Mean :1996 Mean :20.948
# Income share held by lowest 20% :28 3rd Qu.:2004 3rd Qu.:43.797
# Income share held by second 20% :28 Max. :2012 Max. :67.310
# (Other) :28
table(droplevels(WDI_Brazil$Series.Name))
#
# Income share held by fourth 20% Income share held by highest 10% Income share held by highest 20%
# 28 28 28
# Income share held by lowest 10% Income share held by lowest 20% Income share held by second 20%
# 28 28 28
# Income share held by third 20%
# 28
请注意,正如预期的那样,"Series.Name" 中有七个因子水平。
1) 我想在 Gnu R 中对数据集 here 进行子集操作,以获得仅包含巴西、时间和所有内容的结果数据框关于收入份额的系列名称(如 "Income share held by lowest 10%"、“"Income share held by lowest 20%" 等等)。总共会有 7 个关于收入份额的系列名称。
我尝试了以下命令,但无法对多个子集进行子集化 "Series.Name":
test <- melt(subset(WDI, subset = Series.Name == "Income share held by lowest 10%", select = -c(Time.Code, Series.Code, Argentina, Canada, Chile, Colombia, Mexico, USA, Venezuela)), id.vars = c("Series.Name", "Time"))
2) 在第二步中,我想删除所有具有 NA 值的行。
我使用的完整代码如下:
WDI <- read.csv(https://dl.dropboxusercontent.com/u/109495328/WDI_Data_final.csv, na.strings = "..")
library(reshape)
library(reshape2)
WDI <- rename(WDI, (c(Argentina..ARG.="Argentina", Brazil..BRA.="Brazil", Canada..CAN.="Canada", Chile..CHL.="Chile", Colombia..COL.="Colombia", Mexico..MEX.="Mexico", United.States..USA.="USA", Venezuela..RB..VEN.="Venezuela")))
income_brazil_long <- melt(subset(WDI, subset = Series.Name == "Income share held by lowest 10%", select = -c(Time.Code, Series.Code, Argentina, Canada, Chile, Colombia, Mexico, USA, Venezuela)), id.vars = c("Series.Name", "Time"))
好吧,您可以使用 base
函数来完成您想要的事情。
WDI <- read.csv("WDI_Data_final.csv", header=T, na.strings="..")
# The colnames are strange from the file so reset for clarity
colnames(WDI) <- c("Series.Name", "Series.Code", "Time","Time.Code","Argentina",
"Brazil", "Canada", "Chile", "Colombia","Mexico",
"USA", "Venezuela")
# do the subsetting
test <- with(WDI,
WDI[Series.Name=="Income share held by lowest 10%",
c("Brazil","Time", "Series.Name")])
# if you want more, use %in% and specify the Series.Names you care about
test <- with(WDI,
WDI[Series.Name %in% c("Income share held by lowest 10%",
"Income share held by lowest 20%"),
c("Brazil","Time", "Series.Name")])
# if you want all the 'income shares', the grepl solution above by
# Ananda is the most concise.
# you can then use reshape2::melt
melted_test <- melt(test, id.vars=c("Series.Name", "Time"))
要删除 NA
只需使用 complete.cases
test[complete.cases(test),]
查看您的数据,这实际上可能是使用 grepl
帮助进行子集化最简单的方法。
我们使用 grepl
在 "Series.Name" 列中搜索包含字符串 "Income share held" 的任何行。这创建了一个逻辑向量,指示我们想要的行。我们要的列是第一列,第三列,第六列。
将其全部包装在 na.omit
中以删除任何具有 NA
值的行。
WDI_Brazil <- na.omit(WDI[grepl("Income share held", WDI$Series.Name),
c(1, 3, 6)])
数据已经"long",没必要melt
。 data.frame
是什么样子的?
summary(WDI_Brazil)
# Series.Name Time Brazil..BRA.
# Income share held by fourth 20% :28 Min. :1981 Min. : 0.600
# Income share held by highest 10%:28 1st Qu.:1988 1st Qu.: 2.895
# Income share held by highest 20%:28 Median :1996 Median :10.320
# Income share held by lowest 10% :28 Mean :1996 Mean :20.948
# Income share held by lowest 20% :28 3rd Qu.:2004 3rd Qu.:43.797
# Income share held by second 20% :28 Max. :2012 Max. :67.310
# (Other) :28
table(droplevels(WDI_Brazil$Series.Name))
#
# Income share held by fourth 20% Income share held by highest 10% Income share held by highest 20%
# 28 28 28
# Income share held by lowest 10% Income share held by lowest 20% Income share held by second 20%
# 28 28 28
# Income share held by third 20%
# 28
请注意,正如预期的那样,"Series.Name" 中有七个因子水平。