使用 R 进行数据清理需要帮助

Help needed in Data cleaning using R

   "id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,

我需要重新格式化如下。

id gender age category            rank
1 Male    22  movies               1
1 Male    22  music                2
1 Male    22  travel               3
1 Male    22  cloths               4
1 Male    22  grocery              5
1 Male    22  books                NA
1 Male    22  rent                 NA
1 Male    22  fuel                 NA
1 Male    22  utility              NA
1 Male    22  online-shopping      NA

到目前为止,我的努力如下。

mini <- read.csv("coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by  "V1"')

现在我想知道为每个用户填充所有缺失类别的最佳方法是什么。 在这方面的任何帮助表示赞赏。

library(reshape2)
library(tidyr)

mdf <- melt(df, c("id","gender","age"))
complete(na.omit(mdf), c(id, gender, age), value)
# Source: local data frame [50 x 5]
# 
# id gender   age           value  variable
# (int) (fctr) (int)           (chr)    (fctr)
# 1      1   Male    22           books        NA
# 2      1   Male    22          cloths category4
# 3      1   Male    22            fuel        NA
# 4      1   Male    22         grocery category5
# 5      1   Male    22          movies category1
# 6      1   Male    22           music category2
# 7      1   Male    22 online-shopping        NA
# 8      1   Male    22            rent        NA
# 9      1   Male    22          travel category3
# 10     1   Male    22          utiliy        NA
# ..   ...    ...   ...             ...       ...

说明

我们可以先融化指定 id 列的 data.frame。接下来,tidyr 的新版本有一个辅助函数 complete 可以按照您的输出描述扩展列。

数据

df <- read.csv(text='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,')
is.na(df) <- is.na(df) | df== ""

考虑使用基函数 reshape,因为这是从宽到长数据集的常规示例 reshaping/pivoting:

reshapedf <- reshape(df, varying = c(4:13), 
                     v.names = c("category"),
                     timevar=c("rank"), 
                     times = c(1:10),
                     idvar = c("id", "gender", "age"), 
                     new.row.names = 1:1000,
                     direction = "long")

# ORDER RESULTING DATA FRAME
reshapedf <- reshapedf[with(reshapedf , order(id, gender, age)), ]
# RESET ROW NAMES
row.names(reshapedf) <- 1:nrow(reshapedf)

输出

        id      gender      age     rank    category
1       1       Male        22      1       movies
2       1       Male        22      2       music
3       1       Male        22      3       travel
4       1       Male        22      4       cloths
5       1       Male        22      5       grocery
6       1       Male        22      6       NA
7       1       Male        22      7       NA
8       1       Male        22      8       NA
9       1       Male        22      9       NA
10      1       Male        22      10      NA
...