根据 R 中数据框中其他列中值的条件创建一个新的分类列
Create a new categorical column based on conditions of values in other columns in a dataframe in R
虽然看起来之前已经解决了 ,但我的问题更具挑战性。
让我们从我 copied/pasted 来自 Calc sheet 的示例开始:
这是所需的最小可重现示例:
Label <- c("Catalog codes:" , "Themes:", "Size:", "Score:", "Buy Now:",
"Series:", "Catalog codes:", "Themes:", "Related items:", "Buy Now:",
"Catalog codes:", "Themes:", "Size:", "Score:",
"Series:", "Themes:", "Size:", "Score:", "Related items:",
"Catalog codes:", "Themes", "Size:", "Score:", "Related items:", "Buy Now:")
example <- as.data.frame(Label)
我拥有的部分 R 数据框包含一个包含此类列 (Label
) 和许多行的列。
这里的重点是一组行属于一个类别(比方说 Group 1
等等)。您可以在上一张图片中识别粉红色和白色背景中的不同组。
尽管每个组中的标签都有一个内部顺序,并非所有组都包含相同的标签。
但是,每个组中的开始和结束标签保持不变,具体取决于存在的标签。您可以看到 Catalog codes:
和 Series:
开始每个组,而 Buy Now:
、Score:
和 Related items:
结束每个组。
我想在此数据框中创建第二列,它可以识别这些 ending/starting 标签的模式或组合,然后对它们进行分类。结果可能类似于这张图片:
如果您使用 grepl()
搜索开始标签和结束标签,您可以将结束标签移动一行并查看开始标签和结束标签匹配的位置,并使用它来创建您的组 ID cumsum()
。这确保您始终将组的第一个开始标签和组的最后一个结束标签之间的所有内容分组在一起,因为组中可以有多个标签。
Label <- c("Catalog codes:" , "Themes:", "Size:", "Score:", "Buy Now:",
"Series:", "Catalog codes:", "Themes:", "Related items:", "Buy Now:",
"Catalog codes:", "Themes:", "Size:", "Score:",
"Series:", "Themes:", "Size:", "Score:", "Related items:",
"Catalog codes:", "Themes", "Size:", "Score:", "Related items:", "Buy Now:")
example <- as.data.frame(Label)
example$Group <- paste("Group",
cumsum(
grepl("Catalog codes:|Series:", example$Label) * c(TRUE, head(grepl("Buy Now:|Score:|Related items:", example$Label), -1))
)
)
# Result
Label Group
1 Catalog codes: Group 1
2 Themes: Group 1
3 Size: Group 1
4 Score: Group 1
5 Buy Now: Group 1
6 Series: Group 2
7 Catalog codes: Group 2
8 Themes: Group 2
9 Related items: Group 2
10 Buy Now: Group 2
11 Catalog codes: Group 3
12 Themes: Group 3
13 Size: Group 3
14 Score: Group 3
15 Series: Group 4
16 Themes: Group 4
17 Size: Group 4
18 Score: Group 4
19 Related items: Group 4
20 Catalog codes: Group 5
21 Themes Group 5
22 Size: Group 5
23 Score: Group 5
24 Related items: Group 5
25 Buy Now: Group 5
此答案与您想要的输出不完全匹配...请参阅下面的输出...
请说明为什么第 6 行没有 'own' 组,如评论中所问。
library( data.table )
setDT(example)[, Group := paste0( "Group ",
cumsum( grepl( "^Catalog codes|^Series", Label ) )
) ]
# Label Group
# 1: Catalog codes: Group 1
# 2: Themes: Group 1
# 3: Size: Group 1
# 4: Score: Group 1
# 5: Buy Now: Group 1
# 6: Series: Group 2 <-- !!
# 7: Catalog codes: Group 3 <-- !!
# 8: Themes: Group 3
# 9: Related items: Group 3
# 10: Buy Now: Group 3
# 11: Catalog codes: Group 4
# 12: Themes: Group 4
# 13: Size: Group 4
# 14: Score: Group 4
# 15: Series: Group 5
# 16: Themes: Group 5
# 17: Size: Group 5
# 18: Score: Group 5
# 19: Related items: Group 5
# 20: Catalog codes: Group 6
# 21: Themes Group 6
# 22: Size: Group 6
# 23: Score: Group 6
# 24: Related items: Group 6
# 25: Buy Now: Group 6
# Label Group
虽然看起来之前已经解决了
让我们从我 copied/pasted 来自 Calc sheet 的示例开始:
这是所需的最小可重现示例:
Label <- c("Catalog codes:" , "Themes:", "Size:", "Score:", "Buy Now:",
"Series:", "Catalog codes:", "Themes:", "Related items:", "Buy Now:",
"Catalog codes:", "Themes:", "Size:", "Score:",
"Series:", "Themes:", "Size:", "Score:", "Related items:",
"Catalog codes:", "Themes", "Size:", "Score:", "Related items:", "Buy Now:")
example <- as.data.frame(Label)
我拥有的部分 R 数据框包含一个包含此类列 (Label
) 和许多行的列。
这里的重点是一组行属于一个类别(比方说 Group 1
等等)。您可以在上一张图片中识别粉红色和白色背景中的不同组。
尽管每个组中的标签都有一个内部顺序,并非所有组都包含相同的标签。
但是,每个组中的开始和结束标签保持不变,具体取决于存在的标签。您可以看到 Catalog codes:
和 Series:
开始每个组,而 Buy Now:
、Score:
和 Related items:
结束每个组。
我想在此数据框中创建第二列,它可以识别这些 ending/starting 标签的模式或组合,然后对它们进行分类。结果可能类似于这张图片:
如果您使用 grepl()
搜索开始标签和结束标签,您可以将结束标签移动一行并查看开始标签和结束标签匹配的位置,并使用它来创建您的组 ID cumsum()
。这确保您始终将组的第一个开始标签和组的最后一个结束标签之间的所有内容分组在一起,因为组中可以有多个标签。
Label <- c("Catalog codes:" , "Themes:", "Size:", "Score:", "Buy Now:",
"Series:", "Catalog codes:", "Themes:", "Related items:", "Buy Now:",
"Catalog codes:", "Themes:", "Size:", "Score:",
"Series:", "Themes:", "Size:", "Score:", "Related items:",
"Catalog codes:", "Themes", "Size:", "Score:", "Related items:", "Buy Now:")
example <- as.data.frame(Label)
example$Group <- paste("Group",
cumsum(
grepl("Catalog codes:|Series:", example$Label) * c(TRUE, head(grepl("Buy Now:|Score:|Related items:", example$Label), -1))
)
)
# Result
Label Group
1 Catalog codes: Group 1
2 Themes: Group 1
3 Size: Group 1
4 Score: Group 1
5 Buy Now: Group 1
6 Series: Group 2
7 Catalog codes: Group 2
8 Themes: Group 2
9 Related items: Group 2
10 Buy Now: Group 2
11 Catalog codes: Group 3
12 Themes: Group 3
13 Size: Group 3
14 Score: Group 3
15 Series: Group 4
16 Themes: Group 4
17 Size: Group 4
18 Score: Group 4
19 Related items: Group 4
20 Catalog codes: Group 5
21 Themes Group 5
22 Size: Group 5
23 Score: Group 5
24 Related items: Group 5
25 Buy Now: Group 5
此答案与您想要的输出不完全匹配...请参阅下面的输出... 请说明为什么第 6 行没有 'own' 组,如评论中所问。
library( data.table )
setDT(example)[, Group := paste0( "Group ",
cumsum( grepl( "^Catalog codes|^Series", Label ) )
) ]
# Label Group
# 1: Catalog codes: Group 1
# 2: Themes: Group 1
# 3: Size: Group 1
# 4: Score: Group 1
# 5: Buy Now: Group 1
# 6: Series: Group 2 <-- !!
# 7: Catalog codes: Group 3 <-- !!
# 8: Themes: Group 3
# 9: Related items: Group 3
# 10: Buy Now: Group 3
# 11: Catalog codes: Group 4
# 12: Themes: Group 4
# 13: Size: Group 4
# 14: Score: Group 4
# 15: Series: Group 5
# 16: Themes: Group 5
# 17: Size: Group 5
# 18: Score: Group 5
# 19: Related items: Group 5
# 20: Catalog codes: Group 6
# 21: Themes Group 6
# 22: Size: Group 6
# 23: Score: Group 6
# 24: Related items: Group 6
# 25: Buy Now: Group 6
# Label Group