根据新列中的文本值对数据进行分类

Question

我正在尝试获取一个包含状态列的现有数据框，并根据该行的状态添加一个名为 Region 的新列。因此，例如，任何具有“CA”的行都应归类为“West”，任何具有“IL”的行都应归类为 Midwest。有 4 个区域：西部、南部、中西部和东北部。

我试过像这样在 4 个代码块中分别执行此操作：

south <- c("FL", "KY", "GA", "TX", "MS", "SC", "NC", "AL", "LA", "AR", "TN", "VA", "DC", "MD", "DE", "WV") #16 states
south.mdata <- mdata %>% filter(state %in% south)       #1832 locations
south.byyear <- south.mdata %>% group_by(Year) %>% summarize(s.total = n())
south.total <- data %>% filter(state %in% south) %>% group_by(Year) %>% summarize(yearly.total = n())

但这似乎是重复的，而且不是最有效的方法。另外，我希望能够 group_by 年份和地区，以便我可以跨地区进行比较。

我在实现这个时遇到了麻烦，首先想到的是使用过滤器做某种 if/else 循环，但我知道循环并不是真正的 R 风格。

原始数据是这样的：

 Field.1    ID              title description                  streetaddress           city state
1      74 DE074    Cork 'n' Bottle             Route 14, 1 mile south of town Rehoboth Beach    DE
2      75 DE075    Cork 'n' Bottle             Route 14, 1 mile south of town Rehoboth Beach    DE
3      23 DE023          Dog House                           1200 DuPont Hwy.     Wilmington    DE
4      19 DE019          Dog House                            1200 DuPont Hwy     Wilmington    DE
5      26 DE026          Dog House                                1200 Dupont     Wilmington    DE
6      65 DE065 Henlopen Hotel Bar                           Boardwalk & Surf Rehoboth Beach    DE
  amenityfeatures             type Year notes       lon      lat
1         (M),(R)       Restaurant 1977  <NA> -75.07601 38.72095
2         (M),(R)       Restaurant 1976  <NA> -75.07601 38.72095
3         (M),(R)       Restaurant 1975  <NA> -75.58243 39.68839
4         (M),(R)       Restaurant 1976  <NA> -75.58243 39.68839
5         (M),(R)       Restaurant 1974  <NA> -75.58723 39.76705
6             (M) Bars/Clubs,Hotel 1972  <NA> -75.07712 38.72280
                                                                      status
1 Location could not be verified. General city or location coordinates used.
2 Location could not be verified. General city or location coordinates used.
3                                                   Google Verified Location
4                                                   Google Verified Location
5                                                   Google Verified Location
6                                                          Verified Location

我想添加一个名为“Region”的新列，它将遍历每一行，查看状态，然后向 Region 添加一个值。

任何关于正确语法的建议都将不胜感激！非常感谢！

Answer 1

这是 Gregor 评论中建议的解决方案的一个片段。

library(tidyverse)

orig_data <- 
  tribble(~ID, ~state,
          1,   "CA",
          2,   "FL",
          3,   "DE")

region_lookup <- 
  tribble(~state, ~region,
          "CA",   "west",
          "FL",   "south",
          "DE",   "south")

left_join(orig_data, region_lookup)
#> Joining, by = "state"
#> # A tibble: 3 x 3
#>      ID state region
#>   <dbl> <chr> <chr> 
#> 1     1 CA    west  
#> 2     2 FL    south 
#> 3     3 DE    south

^{由 reprex package (v0.3.0)}

创建于 2020-11-02

Answer 2

最简单的解决方案是联接。因此，您需要一个包含所有州和地区的数据。frame/tibble。幸运的是数据已经在 base R:

library(dplyr)
# build the tibble of state abbrevitation and region from base R data
state_region <- dplyr::tibble(state.abb, state.region)
# join it on your data.frame/tibble
ORIGINAL_DATA %>% 
  dplyr::left_join(state_region, by = c("state" = "state.abb"))

现在您应该有一个名为“state.region”的新列，您可以将其分组。请注意，状态必须为大写。

根据新列中的文本值对数据进行分类

Categorizing Data Based on Text Value in New Column

r

tidy