拆分字符串并根据模式提取以形成数据框
split string and extract according to a pattern to form data frame
我试图在 R
中将以下字符串分割为 3 个单独的列(国家、城市、计数)
Country City Count
Japan Tokyo 361
数据:
"country=Japan&city=Tokyo","361"
"country=Spain&city=Barcelona","359"
"country=United Kingdom&city=London","333"
"country=Japan&city=Fukuoka","259"
"country=United States of America&city=New York City","223"
我试过这个:
library(data.table)
library(stringr)
df <- read.table(file.choose(), header = FALSE, sep = ",", colClasses = c('character', 'character'), na.strings = 'null')
df.1 <- data.table(str = as.character(df$V1))
df.2 <- df.1[grepl("country=.+&city=\w+", str),
country := str_extract(str,"(?<=country=)(.+)"),
city := str_extract(str, "(?<=city=)(.+)")]
但是据此,虽然城市格式如我所愿,但国家/地区列将 return 如下:
Japan&city=Tokyo
我想删除 &city=Tokyo 位以使格式更好。
然后,我将 df 和 df.2 合并在一起,以便对齐数值。但是,我认为必须有更聪明的方法来做到这一点。
请与我分享您的知识。感谢您的帮助。
我们可以用base R
strsplit
将'V1'列按=
和&
分割成list
,循环list
,提取替代元素(x[c(FALSE, TRUE)]
),同时用剩余元素命名它,rbind
list
元素,然后 cbind
用原始的第二列命名数据集
res <- do.call(rbind, lapply(strsplit(as.character(df$V1), "[=&]"),
function(x) setNames(x[c(FALSE, TRUE)], x[c(TRUE, FALSE)])))
res1 <- cbind(res, setNames(df[-1], 'Count'))
res1
# country city Count
#1 Japan Tokyo 361
#2 Spain Barcelona 359
#3 United Kingdom London 333
#4 Japan Fukuoka 259
#5 United States of America New York City 223
我们也可以用 tidyverse
来做到这一点。创建一个行索引列(rownames_to_column
from tibble
),然后用分隔符'&'(separate_rows
)拆分'V1'以重塑为'long'格式,通过将 'sep' 指定为 =
,将 'V1' 拆分为新列('new1' 和 'new2'),将数据集重新整形为 'wide'(spread
) 并对列重新排序 (select
)
library(tidyverse)
rownames_to_column(df, 'rn') %>%
separate_rows(V1, sep='[&]') %>%
separate(V1, into= c("new1", "new2"), sep="=") %>%
spread(new1, new2) %>%
select(country, city, Count=V2)
# country city Count
#1 Japan Tokyo 361
#2 Spain Barcelona 359
#3 United Kingdom London 333
#4 Japan Fukuoka 259
#5 United States of America New York City 223
数据
df <- structure(list(V1 = structure(c(2L, 3L, 4L, 1L, 5L),
.Label = c("country=Japan&city=Fukuoka",
"country=Japan&city=Tokyo", "country=Spain&city=Barcelona",
"country=United Kingdom&city=London",
"country=United States of America&city=New York City"), class = "factor"),
V2 = c(361L, 359L, 333L, 259L, 223L)), .Names = c("V1", "V2"
), row.names = c(NA, -5L), class = "data.frame")
您拥有的实际上是 URL 编码查询,因此您可以使用 httr::parse_url
对其进行解码。两个并发症:
parse_url
在查询前面查找 ?
以识别它,因此您必须 paste0
打开它,并且
parse_url
未矢量化,因此必须通过 lapply
或 purrr::map
. 将其应用于每个查询
不过,大多数情况下,它工作得很好:
library(tidyverse)
df <- read_csv('"country=Japan&city=Tokyo","361"
"country=Spain&city=Barcelona","359"
"country=United Kingdom&city=London","333"
"country=Japan&city=Fukuoka","259"
"country=United States of America&city=New York City","223"',
col_names = c('query', 'count'))
df %>% transmute(count,
query = map(paste0('?', query),
~as_data_frame(httr::parse_url(.x)$query))) %>%
unnest()
#> # A tibble: 5 × 3
#> count country city
#> <int> <chr> <chr>
#> 1 361 Japan Tokyo
#> 2 359 Spain Barcelona
#> 3 333 United Kingdom London
#> 4 259 Japan Fukuoka
#> 5 223 United States of America New York City
甚至只是
df %>% do(data.frame(count = .$count,
query = map_df(paste0('?', .$query),
~httr::parse_url(.x)$query)))
或使用 curlconverter::parse_query
或 shiny::parseQueryString
,不需要额外的 ?
:
df %>% bind_cols(map_df(.$query, curlconverter::parse_query)) %>% select(-query)
都是return一样的东西。
我试图在 R
中将以下字符串分割为 3 个单独的列(国家、城市、计数)Country City Count
Japan Tokyo 361
数据:
"country=Japan&city=Tokyo","361"
"country=Spain&city=Barcelona","359"
"country=United Kingdom&city=London","333"
"country=Japan&city=Fukuoka","259"
"country=United States of America&city=New York City","223"
我试过这个:
library(data.table)
library(stringr)
df <- read.table(file.choose(), header = FALSE, sep = ",", colClasses = c('character', 'character'), na.strings = 'null')
df.1 <- data.table(str = as.character(df$V1))
df.2 <- df.1[grepl("country=.+&city=\w+", str),
country := str_extract(str,"(?<=country=)(.+)"),
city := str_extract(str, "(?<=city=)(.+)")]
但是据此,虽然城市格式如我所愿,但国家/地区列将 return 如下:
Japan&city=Tokyo
我想删除 &city=Tokyo 位以使格式更好。
然后,我将 df 和 df.2 合并在一起,以便对齐数值。但是,我认为必须有更聪明的方法来做到这一点。
请与我分享您的知识。感谢您的帮助。
我们可以用base R
strsplit
将'V1'列按=
和&
分割成list
,循环list
,提取替代元素(x[c(FALSE, TRUE)]
),同时用剩余元素命名它,rbind
list
元素,然后 cbind
用原始的第二列命名数据集
res <- do.call(rbind, lapply(strsplit(as.character(df$V1), "[=&]"),
function(x) setNames(x[c(FALSE, TRUE)], x[c(TRUE, FALSE)])))
res1 <- cbind(res, setNames(df[-1], 'Count'))
res1
# country city Count
#1 Japan Tokyo 361
#2 Spain Barcelona 359
#3 United Kingdom London 333
#4 Japan Fukuoka 259
#5 United States of America New York City 223
我们也可以用 tidyverse
来做到这一点。创建一个行索引列(rownames_to_column
from tibble
),然后用分隔符'&'(separate_rows
)拆分'V1'以重塑为'long'格式,通过将 'sep' 指定为 =
,将 'V1' 拆分为新列('new1' 和 'new2'),将数据集重新整形为 'wide'(spread
) 并对列重新排序 (select
)
library(tidyverse)
rownames_to_column(df, 'rn') %>%
separate_rows(V1, sep='[&]') %>%
separate(V1, into= c("new1", "new2"), sep="=") %>%
spread(new1, new2) %>%
select(country, city, Count=V2)
# country city Count
#1 Japan Tokyo 361
#2 Spain Barcelona 359
#3 United Kingdom London 333
#4 Japan Fukuoka 259
#5 United States of America New York City 223
数据
df <- structure(list(V1 = structure(c(2L, 3L, 4L, 1L, 5L),
.Label = c("country=Japan&city=Fukuoka",
"country=Japan&city=Tokyo", "country=Spain&city=Barcelona",
"country=United Kingdom&city=London",
"country=United States of America&city=New York City"), class = "factor"),
V2 = c(361L, 359L, 333L, 259L, 223L)), .Names = c("V1", "V2"
), row.names = c(NA, -5L), class = "data.frame")
您拥有的实际上是 URL 编码查询,因此您可以使用 httr::parse_url
对其进行解码。两个并发症:
parse_url
在查询前面查找?
以识别它,因此您必须paste0
打开它,并且parse_url
未矢量化,因此必须通过lapply
或purrr::map
. 将其应用于每个查询
不过,大多数情况下,它工作得很好:
library(tidyverse)
df <- read_csv('"country=Japan&city=Tokyo","361"
"country=Spain&city=Barcelona","359"
"country=United Kingdom&city=London","333"
"country=Japan&city=Fukuoka","259"
"country=United States of America&city=New York City","223"',
col_names = c('query', 'count'))
df %>% transmute(count,
query = map(paste0('?', query),
~as_data_frame(httr::parse_url(.x)$query))) %>%
unnest()
#> # A tibble: 5 × 3
#> count country city
#> <int> <chr> <chr>
#> 1 361 Japan Tokyo
#> 2 359 Spain Barcelona
#> 3 333 United Kingdom London
#> 4 259 Japan Fukuoka
#> 5 223 United States of America New York City
甚至只是
df %>% do(data.frame(count = .$count,
query = map_df(paste0('?', .$query),
~httr::parse_url(.x)$query)))
或使用 curlconverter::parse_query
或 shiny::parseQueryString
,不需要额外的 ?
:
df %>% bind_cols(map_df(.$query, curlconverter::parse_query)) %>% select(-query)
都是return一样的东西。