拆分字符串并根据模式提取以形成数据框

split string and extract according to a pattern to form data frame

我试图在 R

中将以下字符串分割为 3 个单独的列(国家、城市、计数)
Country    City     Count    
Japan      Tokyo    361

数据:

"country=Japan&city=Tokyo","361"
"country=Spain&city=Barcelona","359"
"country=United Kingdom&city=London","333"
"country=Japan&city=Fukuoka","259"
"country=United States of America&city=New York City","223"

我试过这个:

library(data.table)
library(stringr)

df <- read.table(file.choose(), header = FALSE, sep = ",", colClasses = c('character', 'character'), na.strings = 'null')

df.1 <- data.table(str = as.character(df$V1))

df.2 <- df.1[grepl("country=.+&city=\w+", str),
             country := str_extract(str,"(?<=country=)(.+)"),
             city := str_extract(str, "(?<=city=)(.+)")]

但是据此,虽然城市格式如我所愿,但国家/地区列将 return 如下:

Japan&city=Tokyo

我想删除 &city=Tokyo 位以使格式更好。

然后,我将 df 和 df.2 合并在一起,以便对齐数值。但是,我认为必须有更聪明的方法来做到这一点。

请与我分享您的知识。感谢您的帮助。

我们可以用base Rstrsplit将'V1'列按=&分割成list,循环list,提取替代元素(x[c(FALSE, TRUE)]),同时用剩余元素命名它,rbind list 元素,然后 cbind 用原始的第二列命名数据集

res <- do.call(rbind, lapply(strsplit(as.character(df$V1), "[=&]"), 
             function(x) setNames(x[c(FALSE, TRUE)], x[c(TRUE, FALSE)])))
res1 <- cbind(res, setNames(df[-1], 'Count'))
res1
#                   country          city Count
#1                    Japan         Tokyo   361
#2                    Spain     Barcelona   359
#3           United Kingdom        London   333
#4                    Japan       Fukuoka   259
#5 United States of America New York City   223

我们也可以用 tidyverse 来做到这一点。创建一个行索引列(rownames_to_column from tibble),然后用分隔符'&'(separate_rows)拆分'V1'以重塑为'long'格式,通过将 'sep' 指定为 =,将 'V1' 拆分为新列('new1' 和 'new2'),将数据集重新整形为 'wide'(spread) 并对列重新排序 (select)

library(tidyverse)
rownames_to_column(df, 'rn') %>%
      separate_rows(V1, sep='[&]') %>% 
      separate(V1, into= c("new1", "new2"), sep="=")  %>% 
      spread(new1, new2) %>% 
      select(country, city, Count=V2) 
#                   country          city Count
#1                    Japan         Tokyo   361
#2                    Spain     Barcelona   359
#3           United Kingdom        London   333
#4                    Japan       Fukuoka   259
#5 United States of America New York City   223

数据

df <- structure(list(V1 = structure(c(2L, 3L, 4L, 1L, 5L), 
.Label = c("country=Japan&city=Fukuoka", 
 "country=Japan&city=Tokyo", "country=Spain&city=Barcelona", 
  "country=United Kingdom&city=London", 
"country=United States of America&city=New York City"), class = "factor"), 
V2 = c(361L, 359L, 333L, 259L, 223L)), .Names = c("V1", "V2"
), row.names = c(NA, -5L), class = "data.frame")

您拥有的实际上是 URL 编码查询,因此您可以使用 httr::parse_url 对其进行解码。两个并发症:

  1. parse_url 在查询前面查找 ? 以识别它,因此您必须 paste0 打开它,并且
  2. parse_url 未矢量化,因此必须通过 lapplypurrr::map.
  3. 将其应用于每个查询

不过,大多数情况下,它工作得很好:

library(tidyverse)

df <- read_csv('"country=Japan&city=Tokyo","361"
"country=Spain&city=Barcelona","359"
"country=United Kingdom&city=London","333"
"country=Japan&city=Fukuoka","259"
"country=United States of America&city=New York City","223"', 
               col_names = c('query', 'count'))

df %>% transmute(count, 
                 query = map(paste0('?', query), 
                             ~as_data_frame(httr::parse_url(.x)$query))) %>% 
    unnest()

#> # A tibble: 5 × 3
#>   count                  country          city
#>   <int>                    <chr>         <chr>
#> 1   361                    Japan         Tokyo
#> 2   359                    Spain     Barcelona
#> 3   333           United Kingdom        London
#> 4   259                    Japan       Fukuoka
#> 5   223 United States of America New York City

甚至只是

df %>% do(data.frame(count = .$count, 
                     query = map_df(paste0('?', .$query), 
                                    ~httr::parse_url(.x)$query)))

或使用 curlconverter::parse_queryshiny::parseQueryString,不需要额外的 ?

df %>% bind_cols(map_df(.$query, curlconverter::parse_query)) %>% select(-query)

都是return一样的东西。