R - Yelp 数据业务类别列每个业务有多个类别。想要分成值为 1 和 0 的类别特定列
R - Yelp data Business category column has multiple categories per business. Want to separate into category specific columns with values of 1 and 0
提前感谢任何愿意为此提供帮助的人。
我使用的是Yelp数据集,我想回答的问题是"which categories are positively correlated with higher stars for X category (Bars for example)"
我遇到的问题是,对于每项业务,每个 businesss_id 的类别都集中在一列和一行中。所以我需要一种方法来分离每个类别,将它们变成列,然后检查原始类别列是否包含为其创建列的类别。
我目前的思路是使用 group_by 和 business_id 然后 unnest_tokens 列,然后 model.matrix() 该列进入我想要的拆分然后将它加入到我正在使用的 df 中。但是我无法让 model.matrix 通过并保持 business_id 连接到每一行。
# an example of what I am using #
df <-
data_frame(business_id = c("bus_1",
"bus_2",
"bus_3"),
categories=c("Pizza, Burgers, Caterers",
"Pizza, Restaurants, Bars",
"American, Barbeque, Restaurants"))
# what I want it to look like #
desired_df <-
data_frame(business_id = c("bus_1",
"bus_2",
"bus_3"),
categories=c("Pizza, Burgers, Caterers",
"Pizza, Restaurants, Bars",
"American, Barbeque, Restaurants"),
Pizza = c(1, 1, 0),
Burgers = c(1, 0, 0),
Caterers = c(1, 0, 0),
Restaurants = c(0, 1, 1),
Bars = c(0, 1, 0),
American = c(0, 0, 1),
Barbeque = c(0, 0, 1))
# where I am stuck #
df %>%
select(business_id, categories) %>%
group_by(business_id) %>%
unnest_tokens(categories, categories, token = 'regex', pattern=", ") %>%
model.matrix(business_id ~ categories, data = .) %>%
as_data_frame
编辑:在此 post 和下面的答案之后,我在使用 spread() 时遇到了重复标识符错误。这把我带到了这个帖子 https://github.com/tidyverse/tidyr/issues/426,我的问题的答案是 posted,我在下面重新粘贴了它。
# 用较小的 data.frame #
复制错误
library(tidyverse)
df <- structure(list(age = c("21", "17", "32", "29", "15"),
gender = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor")),
row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("age", "gender"))
df
#> # A tibble: 5 x 2
#> age gender
#> <chr> <fct>
#> 1 21 Male
#> 2 17 Female
#> 3 32 Female
#> 4 29 Male
#> 5 15 Male
df %>%
spread(key=gender, value=age)
#> Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)
#修复问题#
df %>%
group_by_at(vars(-age)) %>% # group by everything other than the value column.
mutate(row_id=1:n()) %>% ungroup() %>% # build group index
spread(key=gender, value=age) %>% # spread
select(-row_id) # drop the index
#> # A tibble: 3 x 2
#> Female Male
#> <chr> <chr>
#> 1 17 21
#> 2 32 29
#> 3 NA 15
这是一个简单的 tidyverse 解决方案:
library(tidyverse)
df %>%
mutate(
ind = 1,
tmp = strsplit(categories, ", ")
) %>%
unnest(tmp) %>%
spread(tmp, ind, fill = 0)
## A tibble: 3 x 9
# business_id categories American Barbeque Bars Burgers Caterers Pizza Restaurants
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 bus_1 Pizza, Burgers, Caterers 0 0 0 1 1 1 0
#2 bus_2 Pizza, Restaurants, Bars 0 0 1 0 0 1 1
#3 bus_3 American, Barbeque, Restaurants 1 1 0 0 0 0 1
基于您对 tidytext::unnest_tokens()
的良好使用,您还可以使用此替代解决方案
library(dplyr)
library(tidyr)
library(tidytext)
df %>%
select(business_id, categories) %>%
group_by(business_id) %>%
unnest_tokens(categories, categories, token = 'regex', pattern=", ") %>%
mutate(value = 1) %>%
spread(categories, value, fill = 0)
# business_id american barbeque bars burgers caterers pizza restaurants
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# bus_1 0 0 0 1 1 1 0
# bus_2 0 0 1 0 0 1 1
# bus_3 1 1 0 0 0 0 1
提前感谢任何愿意为此提供帮助的人。
我使用的是Yelp数据集,我想回答的问题是"which categories are positively correlated with higher stars for X category (Bars for example)"
我遇到的问题是,对于每项业务,每个 businesss_id 的类别都集中在一列和一行中。所以我需要一种方法来分离每个类别,将它们变成列,然后检查原始类别列是否包含为其创建列的类别。
我目前的思路是使用 group_by 和 business_id 然后 unnest_tokens 列,然后 model.matrix() 该列进入我想要的拆分然后将它加入到我正在使用的 df 中。但是我无法让 model.matrix 通过并保持 business_id 连接到每一行。
# an example of what I am using #
df <-
data_frame(business_id = c("bus_1",
"bus_2",
"bus_3"),
categories=c("Pizza, Burgers, Caterers",
"Pizza, Restaurants, Bars",
"American, Barbeque, Restaurants"))
# what I want it to look like #
desired_df <-
data_frame(business_id = c("bus_1",
"bus_2",
"bus_3"),
categories=c("Pizza, Burgers, Caterers",
"Pizza, Restaurants, Bars",
"American, Barbeque, Restaurants"),
Pizza = c(1, 1, 0),
Burgers = c(1, 0, 0),
Caterers = c(1, 0, 0),
Restaurants = c(0, 1, 1),
Bars = c(0, 1, 0),
American = c(0, 0, 1),
Barbeque = c(0, 0, 1))
# where I am stuck #
df %>%
select(business_id, categories) %>%
group_by(business_id) %>%
unnest_tokens(categories, categories, token = 'regex', pattern=", ") %>%
model.matrix(business_id ~ categories, data = .) %>%
as_data_frame
编辑:在此 post 和下面的答案之后,我在使用 spread() 时遇到了重复标识符错误。这把我带到了这个帖子 https://github.com/tidyverse/tidyr/issues/426,我的问题的答案是 posted,我在下面重新粘贴了它。
# 用较小的 data.frame #
复制错误library(tidyverse)
df <- structure(list(age = c("21", "17", "32", "29", "15"),
gender = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor")),
row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("age", "gender"))
df
#> # A tibble: 5 x 2
#> age gender
#> <chr> <fct>
#> 1 21 Male
#> 2 17 Female
#> 3 32 Female
#> 4 29 Male
#> 5 15 Male
df %>%
spread(key=gender, value=age)
#> Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)
#修复问题#
df %>%
group_by_at(vars(-age)) %>% # group by everything other than the value column.
mutate(row_id=1:n()) %>% ungroup() %>% # build group index
spread(key=gender, value=age) %>% # spread
select(-row_id) # drop the index
#> # A tibble: 3 x 2
#> Female Male
#> <chr> <chr>
#> 1 17 21
#> 2 32 29
#> 3 NA 15
这是一个简单的 tidyverse 解决方案:
library(tidyverse)
df %>%
mutate(
ind = 1,
tmp = strsplit(categories, ", ")
) %>%
unnest(tmp) %>%
spread(tmp, ind, fill = 0)
## A tibble: 3 x 9
# business_id categories American Barbeque Bars Burgers Caterers Pizza Restaurants
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 bus_1 Pizza, Burgers, Caterers 0 0 0 1 1 1 0
#2 bus_2 Pizza, Restaurants, Bars 0 0 1 0 0 1 1
#3 bus_3 American, Barbeque, Restaurants 1 1 0 0 0 0 1
基于您对 tidytext::unnest_tokens()
的良好使用,您还可以使用此替代解决方案
library(dplyr)
library(tidyr)
library(tidytext)
df %>%
select(business_id, categories) %>%
group_by(business_id) %>%
unnest_tokens(categories, categories, token = 'regex', pattern=", ") %>%
mutate(value = 1) %>%
spread(categories, value, fill = 0)
# business_id american barbeque bars burgers caterers pizza restaurants
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# bus_1 0 0 0 1 1 1 0
# bus_2 0 0 1 0 0 1 1
# bus_3 1 1 0 0 0 0 1