将不均匀的字符串拆分为 R 中的排序列
Split uneven strings into sorted columns in R
进行了一项调查,其中一个问题可以选择 select 多个答案。当select输入多个答案时,它们都被记录在同一个单元格中。
此外,每个测量员在单元格中记录此信息的方式都不同。有时分隔符是连字符 (-),有时是前斜杠 (/)。还有一些测量员列出了带有数字的项目。
一个例子是房子里的物品清单(见below/image)。我想在每个项目可用时创建列(新列可以有 1/0 或项目 name/NA)(见下面的结果示例)。
我可以在 excel 中使用文本到列和查找数组来执行此操作,但是有太多 excel 工作表具有同一列,我必须在 R 中执行此操作. 抱歉,我不知道如何使用 R 代码制作示例 table,但希望有人能够提供帮助。
数据是这样的:
House = c("h1","h2","h3","h4","h5","h6","h7","h8","h9","h10","h11")
Items = c("Chair", "Chair- Window/Glass- "," Door- Sofa-", "Chair-
Window/Glass Frame- ", "1. Window/Glass Frame", "Chair- Door- Window-", "Chair- Sofa - Door- Table-", " 4. Table", "Couch (2)", "Window- Table- Chair- Sofa- Door- Couach", "2. Door / Chair")
table1 = as.data.table(House)
table2 = as.data.table(Items)
table = as.data.frame(append(table1, table2))
table
+-------+------------------------------------------+
| House | Items |
+-------+------------------------------------------+
| 001 | Chair |
| 002 | Chair- Window/Glass- |
| 003 | Door- Sofa- |
| 004 | Chair- Window/Glass Frame- |
| 005 | 1. Window/Glass Frame |
| 006 | Chair- Door- Window- |
| 007 | Chair- Sofa - Door- Table- |
| 008 | 4. Table |
| 009 | Couch (2) |
| 010 | Window- Table- Chair- Sofa- Door- Couach |
| 011 | 2. Door / Chair |
+-------+------------------------------------------+
我的想法是使用所有分隔符 (strsplit) 进行拆分,删除空格 (trimws),获得一个唯一列表 (unique),然后将所有变体替换为我想要的标准 (grepl),最后将它们放入列中根据类别。
items <- strsplit(df$Items, "[/.-]")
items <- trimws(items)
items <- df$Items %>%
strsplit("[/.-]") %>%
str_trim(side = "both")
items_list <- unique(items)
这就是我想要得到的:
(Window 和玻璃是一样的,chair/sofa/couch 是一样的,等等——所以我只需要创建更大的类别,而不是有几列基本相同的东西)
Outcome
+-------+-------+--------+-------+------+
| House | Chair | Window | Table | Door |
+-------+-------+--------+-------+------+
| 001 | Chair | | | |
| 002 | Chair | Window | | |
| 003 | Chair | | | Door |
| 004 | Chair | Window | | |
| 005 | | Window | | |
| 006 | Chair | Window | | Door |
| 007 | Chair | | Table | Door |
| 008 | | | Table | |
| 009 | Chair | | | |
| 010 | Chair | Window | Table | Door |
| 011 | Chair | | | Door |
+-------+-------+--------+-------+------+
您可以在map_df
(或sapply
)中使用str_detect
(或grepl
)生成逻辑数据帧,将其强制为整数0/1,然后将其绑定到您的原始数据框。这种方法绕过了splitting/cleaning/etc的麻烦。数据。它只需要您首先为正则表达式创建模式组,即 chair|sofa|couach|couch
、window|glass
:
library(stringr)
library(dplyr)
library(purrr)
# Create regex pattern groups.
patts <- c(chair = "chair|sofa|couach|couch", window = "window|glass",
table = "table", door = "door")
# Detect pattern groups, coerce to 0/1, bind to origional dataframe.
map_df(patts, ~ str_detect(df$Items, regex(., ignore_case = T))) %>%
mutate_all(as.integer) %>%
bind_cols(df, .)
此 returns 以下数据框:
# A tibble: 11 x 6
House Items chair window table door
<dbl> <chr> <int> <int> <int> <int>
1 1 Chair 1 0 0 0
2 2 "Chair- Window/Glass- " 1 1 0 0
3 3 " Door- Sofa-" 1 0 0 1
4 4 "Chair- Window/Glass Frame- " 1 1 0 0
5 5 1. Window/Glass Frame 0 1 0 0
6 6 Chair- Door- Window- 1 1 0 1
7 7 Chair- Sofa - Door- Table- 1 0 1 1
8 8 " 4. Table" 0 0 1 0
9 9 Couch (2) 1 0 0 0
10 10 Window- Table- Chair- Sofa- Door- Couach 1 1 1 1
11 11 2. Door / Chair 1 0 0 1
数据:
df <- tibble(House = c(1,2,3,4,5,6,7,8,9,10,11), Items = c("Chair", "Chair- Window/Glass- "," Door- Sofa-", "Chair- Window/Glass Frame- ", "1. Window/Glass Frame", "Chair- Door- Window-", "Chair- Sofa - Door- Table-", " 4. Table", "Couch (2)", "Window- Table- Chair- Sofa- Door- Couach", "2. Door / Chair"))
进行了一项调查,其中一个问题可以选择 select 多个答案。当select输入多个答案时,它们都被记录在同一个单元格中。
此外,每个测量员在单元格中记录此信息的方式都不同。有时分隔符是连字符 (-),有时是前斜杠 (/)。还有一些测量员列出了带有数字的项目。
一个例子是房子里的物品清单(见below/image)。我想在每个项目可用时创建列(新列可以有 1/0 或项目 name/NA)(见下面的结果示例)。
我可以在 excel 中使用文本到列和查找数组来执行此操作,但是有太多 excel 工作表具有同一列,我必须在 R 中执行此操作. 抱歉,我不知道如何使用 R 代码制作示例 table,但希望有人能够提供帮助。
数据是这样的:
House = c("h1","h2","h3","h4","h5","h6","h7","h8","h9","h10","h11")
Items = c("Chair", "Chair- Window/Glass- "," Door- Sofa-", "Chair-
Window/Glass Frame- ", "1. Window/Glass Frame", "Chair- Door- Window-", "Chair- Sofa - Door- Table-", " 4. Table", "Couch (2)", "Window- Table- Chair- Sofa- Door- Couach", "2. Door / Chair")
table1 = as.data.table(House)
table2 = as.data.table(Items)
table = as.data.frame(append(table1, table2))
table
+-------+------------------------------------------+
| House | Items |
+-------+------------------------------------------+
| 001 | Chair |
| 002 | Chair- Window/Glass- |
| 003 | Door- Sofa- |
| 004 | Chair- Window/Glass Frame- |
| 005 | 1. Window/Glass Frame |
| 006 | Chair- Door- Window- |
| 007 | Chair- Sofa - Door- Table- |
| 008 | 4. Table |
| 009 | Couch (2) |
| 010 | Window- Table- Chair- Sofa- Door- Couach |
| 011 | 2. Door / Chair |
+-------+------------------------------------------+
我的想法是使用所有分隔符 (strsplit) 进行拆分,删除空格 (trimws),获得一个唯一列表 (unique),然后将所有变体替换为我想要的标准 (grepl),最后将它们放入列中根据类别。
items <- strsplit(df$Items, "[/.-]")
items <- trimws(items)
items <- df$Items %>%
strsplit("[/.-]") %>%
str_trim(side = "both")
items_list <- unique(items)
这就是我想要得到的: (Window 和玻璃是一样的,chair/sofa/couch 是一样的,等等——所以我只需要创建更大的类别,而不是有几列基本相同的东西)
Outcome
+-------+-------+--------+-------+------+
| House | Chair | Window | Table | Door |
+-------+-------+--------+-------+------+
| 001 | Chair | | | |
| 002 | Chair | Window | | |
| 003 | Chair | | | Door |
| 004 | Chair | Window | | |
| 005 | | Window | | |
| 006 | Chair | Window | | Door |
| 007 | Chair | | Table | Door |
| 008 | | | Table | |
| 009 | Chair | | | |
| 010 | Chair | Window | Table | Door |
| 011 | Chair | | | Door |
+-------+-------+--------+-------+------+
您可以在map_df
(或sapply
)中使用str_detect
(或grepl
)生成逻辑数据帧,将其强制为整数0/1,然后将其绑定到您的原始数据框。这种方法绕过了splitting/cleaning/etc的麻烦。数据。它只需要您首先为正则表达式创建模式组,即 chair|sofa|couach|couch
、window|glass
:
library(stringr)
library(dplyr)
library(purrr)
# Create regex pattern groups.
patts <- c(chair = "chair|sofa|couach|couch", window = "window|glass",
table = "table", door = "door")
# Detect pattern groups, coerce to 0/1, bind to origional dataframe.
map_df(patts, ~ str_detect(df$Items, regex(., ignore_case = T))) %>%
mutate_all(as.integer) %>%
bind_cols(df, .)
此 returns 以下数据框:
# A tibble: 11 x 6
House Items chair window table door
<dbl> <chr> <int> <int> <int> <int>
1 1 Chair 1 0 0 0
2 2 "Chair- Window/Glass- " 1 1 0 0
3 3 " Door- Sofa-" 1 0 0 1
4 4 "Chair- Window/Glass Frame- " 1 1 0 0
5 5 1. Window/Glass Frame 0 1 0 0
6 6 Chair- Door- Window- 1 1 0 1
7 7 Chair- Sofa - Door- Table- 1 0 1 1
8 8 " 4. Table" 0 0 1 0
9 9 Couch (2) 1 0 0 0
10 10 Window- Table- Chair- Sofa- Door- Couach 1 1 1 1
11 11 2. Door / Chair 1 0 0 1
数据:
df <- tibble(House = c(1,2,3,4,5,6,7,8,9,10,11), Items = c("Chair", "Chair- Window/Glass- "," Door- Sofa-", "Chair- Window/Glass Frame- ", "1. Window/Glass Frame", "Chair- Door- Window-", "Chair- Sofa - Door- Table-", " 4. Table", "Couch (2)", "Window- Table- Chair- Sofa- Door- Couach", "2. Door / Chair"))