如何将变量的 1 个以上因子关联到数据框中的同一条目?

How to associate more than 1 factor of a variable to the same entry into a data frame?

这是我向 Whosebug 社区提出的第一个问题。首先,非常感谢过去 5 年来我在这里设法找到的所有答案。你们都非常有帮助,但现在我找不到答案了。

所以,这是我的情况。在一个更大的数据框架中,有一个变量给我带来了麻烦:天气。它由定义天气的因素组成,例如:"Rainy"、"Cloudy"、"Sunny" 等。我的问题是某些条目由多个因素定义(例如 "rainy,foggy").因此,R 将这些因素的组合视为新的独立因素,这是我不想要的。

这是数据框的示例:

df <- read.table(text =
'"Date.Time","Year","Month","Day","Weekday","Hour","Temperature","Rel.humidity","Wind.dir","Wind.dir2","Wind.speed","Atm.pressure","Weather"
2015-04-01 00:00:00,"2015","4","1","Wednesday","00:00",-3.4,44,30,"NW",10,100.83,"Clear"
2015-04-02 23:00:00,"2015","4","2","Thursday","23:00",3.4,94,36,"N",2,99.8,"Rain,Fog"
2015-05-11 12:00:00,"2015","5","11","Monday","12:00",9.5,93,3,"NE",27,101.5,"Mist,Shower,Fog"',
header = TRUE, stringsAsFactors = FALSE, sep = ",")

我的最终目标是能够,例如,select 仅标记为 Fog 的条目,包括同时具有 Rain 和 Fog 的条目。

我的解决方案想法是应用字符拆分并将结果插入将放入 Weather 变量的列表中,但我还不能这样做,也许有更简单、更高级的方法。 这是我天真的尝试:

for (i in dim(df)[1]){
  df[i,] <- as.factor(list(strsplit(dda[i,], ",")))
}

tldr;我想将一个因子如"A,B,C"转化为多个因子"A"、"B"、"C"转化为同一个元素(数据框的同一列、同一行)

提前感谢您的宝贵时间,请随时评论我的问题格式。

df <- read.table(text =
'"Date.Time","Year","Month","Day","Weekday","Hour","Temperature","Rel.humidity","Wind.dir","Wind.dir2","Wind.speed","Atm.pressure","Weather"
2015-04-01 00:00:00,"2015","4","1","Wednesday","00:00",-3.4,44,30,"NW",10,100.83,"Clear"
2015-04-02 23:00:00,"2015","4","2","Thursday","23:00",3.4,94,36,"N",2,99.8,"Rain,Fog"
2015-05-11 12:00:00,"2015","5","11","Monday","12:00",9.5,93,3,"NE",27,101.5,"Mist,Shower,Fog"',
header = TRUE, stringsAsFactors = FALSE, sep = ",")

修复你的 for 循环:

df[["Weather_split"]] <- as.list(rep(NA, nrow(df)))
for (i in seq_len(nrow(df))) {
  df[["Weather_split"]][[i]] <- strsplit(df[["Weather"]][[i]], ",")[[1]]
}

同样的事情,更简单:

df[["Weather_split"]] <- strsplit(df[["Weather"]], ",")
str(df$Weather)
# chr [1:3] "Clear" "Rain,Fog" "Mist,Shower,Fog"
str(df$Weather_split)
# List of 3
#  $ : chr "Clear"
#  $ : chr [1:2] "Rain" "Fog"
#  $ : chr [1:3] "Mist" "Shower" "Fog"

@Stephen Henderson 的想法更进一步:

Weather_levels <- unique(unlist(df[["Weather_split"]]))
for (lvl in Weather_levels) {
  df[[lvl]] <- unlist(lapply(df$Weather_split, "%in%", x = lvl))
}

df
#             Date.Time Year Month Day   Weekday  Hour Temperature Rel.humidity Wind.dir Wind.dir2 Wind.speed Atm.pressure         Weather     Weather_split Clear  Rain   Fog  Mist Shower
# 1 2015-04-01 00:00:00 2015     4   1 Wednesday 00:00        -3.4           44       30        NW         10       100.83           Clear             Clear  TRUE FALSE FALSE FALSE  FALSE
# 2 2015-04-02 23:00:00 2015     4   2  Thursday 23:00         3.4           94       36         N          2        99.80        Rain,Fog         Rain, Fog FALSE  TRUE  TRUE FALSE  FALSE
# 3 2015-05-11 12:00:00 2015     5  11    Monday 12:00         9.5           93        3        NE         27       101.50 Mist,Shower,Fog Mist, Shower, Fog FALSE FALSE  TRUE  TRUE   TRUE

编辑:

如果按照你的问题,你真的需要因子而不是字符向量,那是完全可行的:

df$Weather_split <- lapply(df$Weather_split, factor, levels = Weather_levels)
df$Weather_split
# [[1]]
# [1] Clear
# Levels: Clear Rain Fog Mist Shower
# 
# [[2]]
# [1] Rain Fog 
# Levels: Clear Rain Fog Mist Shower
# 
# [[3]]
# [1] Mist   Shower Fog   
# Levels: Clear Rain Fog Mist Shower
str(df$Weather_split)
# List of 3
#  $ : Factor w/ 5 levels "Clear","Rain",..: 1
#  $ : Factor w/ 5 levels "Clear","Rain",..: 2 3
#  $ : Factor w/ 5 levels "Clear","Rain",..: 4 5 3