根据新列中的文本值对数据进行分类
Categorizing Data Based on Text Value in New Column
我正在尝试获取一个包含状态列的现有数据框,并根据该行的状态添加一个名为 Region 的新列。因此,例如,任何具有“CA”的行都应归类为“West”,任何具有“IL”的行都应归类为 Midwest。有 4 个区域:西部、南部、中西部和东北部。
我试过像这样在 4 个代码块中分别执行此操作:
south <- c("FL", "KY", "GA", "TX", "MS", "SC", "NC", "AL", "LA", "AR", "TN", "VA", "DC", "MD", "DE", "WV") #16 states
south.mdata <- mdata %>% filter(state %in% south) #1832 locations
south.byyear <- south.mdata %>% group_by(Year) %>% summarize(s.total = n())
south.total <- data %>% filter(state %in% south) %>% group_by(Year) %>% summarize(yearly.total = n())
但这似乎是重复的,而且不是最有效的方法。另外,我希望能够 group_by 年份和地区,以便我可以跨地区进行比较。
我在实现这个时遇到了麻烦,首先想到的是使用过滤器做某种 if/else 循环,但我知道循环并不是真正的 R 风格。
原始数据是这样的:
Field.1 ID title description streetaddress city state
1 74 DE074 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
2 75 DE075 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
3 23 DE023 Dog House 1200 DuPont Hwy. Wilmington DE
4 19 DE019 Dog House 1200 DuPont Hwy Wilmington DE
5 26 DE026 Dog House 1200 Dupont Wilmington DE
6 65 DE065 Henlopen Hotel Bar Boardwalk & Surf Rehoboth Beach DE
amenityfeatures type Year notes lon lat
1 (M),(R) Restaurant 1977 <NA> -75.07601 38.72095
2 (M),(R) Restaurant 1976 <NA> -75.07601 38.72095
3 (M),(R) Restaurant 1975 <NA> -75.58243 39.68839
4 (M),(R) Restaurant 1976 <NA> -75.58243 39.68839
5 (M),(R) Restaurant 1974 <NA> -75.58723 39.76705
6 (M) Bars/Clubs,Hotel 1972 <NA> -75.07712 38.72280
status
1 Location could not be verified. General city or location coordinates used.
2 Location could not be verified. General city or location coordinates used.
3 Google Verified Location
4 Google Verified Location
5 Google Verified Location
6 Verified Location
我想添加一个名为“Region”的新列,它将遍历每一行,查看状态,然后向 Region 添加一个值。
任何关于正确语法的建议都将不胜感激!非常感谢!
这是 Gregor 评论中建议的解决方案的一个片段。
library(tidyverse)
orig_data <-
tribble(~ID, ~state,
1, "CA",
2, "FL",
3, "DE")
region_lookup <-
tribble(~state, ~region,
"CA", "west",
"FL", "south",
"DE", "south")
left_join(orig_data, region_lookup)
#> Joining, by = "state"
#> # A tibble: 3 x 3
#> ID state region
#> <dbl> <chr> <chr>
#> 1 1 CA west
#> 2 2 FL south
#> 3 3 DE south
由 reprex package (v0.3.0)
创建于 2020-11-02
最简单的解决方案是联接。因此,您需要一个包含所有州和地区的数据。frame/tibble。幸运的是数据已经在 base R:
library(dplyr)
# build the tibble of state abbrevitation and region from base R data
state_region <- dplyr::tibble(state.abb, state.region)
# join it on your data.frame/tibble
ORIGINAL_DATA %>%
dplyr::left_join(state_region, by = c("state" = "state.abb"))
现在您应该有一个名为“state.region”的新列,您可以将其分组。请注意,状态必须为大写。
我正在尝试获取一个包含状态列的现有数据框,并根据该行的状态添加一个名为 Region 的新列。因此,例如,任何具有“CA”的行都应归类为“West”,任何具有“IL”的行都应归类为 Midwest。有 4 个区域:西部、南部、中西部和东北部。
我试过像这样在 4 个代码块中分别执行此操作:
south <- c("FL", "KY", "GA", "TX", "MS", "SC", "NC", "AL", "LA", "AR", "TN", "VA", "DC", "MD", "DE", "WV") #16 states
south.mdata <- mdata %>% filter(state %in% south) #1832 locations
south.byyear <- south.mdata %>% group_by(Year) %>% summarize(s.total = n())
south.total <- data %>% filter(state %in% south) %>% group_by(Year) %>% summarize(yearly.total = n())
但这似乎是重复的,而且不是最有效的方法。另外,我希望能够 group_by 年份和地区,以便我可以跨地区进行比较。
我在实现这个时遇到了麻烦,首先想到的是使用过滤器做某种 if/else 循环,但我知道循环并不是真正的 R 风格。
原始数据是这样的:
Field.1 ID title description streetaddress city state
1 74 DE074 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
2 75 DE075 Cork 'n' Bottle Route 14, 1 mile south of town Rehoboth Beach DE
3 23 DE023 Dog House 1200 DuPont Hwy. Wilmington DE
4 19 DE019 Dog House 1200 DuPont Hwy Wilmington DE
5 26 DE026 Dog House 1200 Dupont Wilmington DE
6 65 DE065 Henlopen Hotel Bar Boardwalk & Surf Rehoboth Beach DE
amenityfeatures type Year notes lon lat
1 (M),(R) Restaurant 1977 <NA> -75.07601 38.72095
2 (M),(R) Restaurant 1976 <NA> -75.07601 38.72095
3 (M),(R) Restaurant 1975 <NA> -75.58243 39.68839
4 (M),(R) Restaurant 1976 <NA> -75.58243 39.68839
5 (M),(R) Restaurant 1974 <NA> -75.58723 39.76705
6 (M) Bars/Clubs,Hotel 1972 <NA> -75.07712 38.72280
status
1 Location could not be verified. General city or location coordinates used.
2 Location could not be verified. General city or location coordinates used.
3 Google Verified Location
4 Google Verified Location
5 Google Verified Location
6 Verified Location
我想添加一个名为“Region”的新列,它将遍历每一行,查看状态,然后向 Region 添加一个值。
任何关于正确语法的建议都将不胜感激!非常感谢!
这是 Gregor 评论中建议的解决方案的一个片段。
library(tidyverse)
orig_data <-
tribble(~ID, ~state,
1, "CA",
2, "FL",
3, "DE")
region_lookup <-
tribble(~state, ~region,
"CA", "west",
"FL", "south",
"DE", "south")
left_join(orig_data, region_lookup)
#> Joining, by = "state"
#> # A tibble: 3 x 3
#> ID state region
#> <dbl> <chr> <chr>
#> 1 1 CA west
#> 2 2 FL south
#> 3 3 DE south
由 reprex package (v0.3.0)
创建于 2020-11-02最简单的解决方案是联接。因此,您需要一个包含所有州和地区的数据。frame/tibble。幸运的是数据已经在 base R:
library(dplyr)
# build the tibble of state abbrevitation and region from base R data
state_region <- dplyr::tibble(state.abb, state.region)
# join it on your data.frame/tibble
ORIGINAL_DATA %>%
dplyr::left_join(state_region, by = c("state" = "state.abb"))
现在您应该有一个名为“state.region”的新列,您可以将其分组。请注意,状态必须为大写。