在 R 中按模式合并行
Merge rows by pattern in R
我正在尝试按模式合并行。
数据框只有一列(字符串),通常,它应该遵循日期、company_name 和薪水的模式。但是,有些情况就是没有工资。
有没有一种方法可以按日期模式合并行?通过这样做,我可以稍后将它们分成几列。我不想早点做 pivot_wider 的原因是公司名称和薪水之间很可能不匹配 - 不平衡的行。所以我认为最好按日期模式合并行,因为日期永远不会丢失并遵循模式。
数据集:
# A tibble: 10 x 1
detail
<chr>
1 26 January 2021
2 NatWest Group - Bristol, BS2 0PT
3 26 January 2021
4 NatWest Group - Manchester, M3 3AQ
5 15 February 2021
6 Brook Street - Liverpool, Merseyside, L21AB
7 £13.84 per hour
8 16 February 2021
9 Anglo Technical Recruitment - London, WC2N 5DU
10 £400.00 per day
数据集的输入:
structure(list(detail = c("26 January 2021", "NatWest Group - Bristol, BS2 0PT",
"26 January 2021", "NatWest Group - Manchester, M3 3AQ", "15 February 2021",
"Brook Street - Liverpool, Merseyside, L21AB", "£13.84 per hour",
"16 February 2021", "Anglo Technical Recruitment - London, WC2N 5DU",
"£400.00 per day")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
预期结果:
detail
<chr>
1 26 January 2021 NatWest Group - Bristol, BS2 0PT
2 26 January 2021 NatWest Group - Manchester, M3 3AQ
3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
dput 预期结果:
df <- structure(list(detail = c("26 January 2021 NatWest Group - Bristol, BS2 0PT",
"26 January 2021 NatWest Group - Manchester, M3 3AQ", "15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour",
"16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
在每行前面加上一个标签,然后使用 read.dcf
创建一个 3 列字符矩阵 mat
。最后,我们将其转换为每个逻辑记录一个元素的字符向量,但您可能只想使用 mat
,因为这似乎是一种更有用的格式。
我们假设日期采用 %d %B %Y 格式(请参阅 ?strptime
了解百分比代码),工资行以 £ 开头,其他行是地址行。
library(dplyr)
mat <- dat %>%
mutate(detail = case_when(
!is.na(as.Date(detail, "%d %B %Y")) ~ paste("\nDate:", detail),
grepl("^£", detail) ~ paste("Salary:", detail),
TRUE ~ paste("Address:", detail))) %>%
{ read.dcf(textConnection(.$detail)) }
mat %>%
apply(1, toString) %>%
sub(", NA$", "", .)
更新
简化假设和代码。
另一种解决方案假设仅第一行包含日期。无论两个日期之间的行数如何,它都会起作用..
library(tidyverse)
df %>% group_by(d = cumsum(str_detect(detail, "^(^\d\d? \w+ \d{4})$"))) %>%
mutate(c = paste0("Col", as.character(row_number()))) %>%
pivot_wider(id_cols = d, values_from = detail, names_from = c)
# A tibble: 4 x 4
# Groups: d [4]
d Col1 Col2 Col3
<int> <chr> <chr> <chr>
1 1 26 January 2021 NatWest Group - Bristol, BS2 0PT NA
2 2 26 January 2021 NatWest Group - Manchester, M3 3AQ NA
3 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
这是一个纯粹的data.table
方法
library( data.table )
#make it a data.table
setDT( df )
#first, summarise by block separated by days, collapse the text, using @@ as separator
ans <- df[, .( paste0( detail, collapse = "@@") ),
by = .(d = cumsum( ( grepl( "[0-9]{2} [a-zA-Z]+ [0-9]{4}", detail) ) ) ) ]
#split text again to cols, based on te @@ introduced in the collapse/ Number of cols is dynamic!
ans[, paste0( "Col", 1:length( tstrsplit(ans$V1, "@@" ))) := tstrsplit( V1, "@@" )][, V1 := NULL ][]
# d Col1 Col2 Col3
# 1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
# 2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
# 3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
# 4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
这是一个 data.table 方法,它使用 dcast()
和 rowid()
重塑为宽格式。它returns一个data.table有四列:记录号,日期,
company_name,还有工资。
library(data.table)
setDT(df1)[, rn := cumsum(!is.na(lubridate::dmy(detail)))]
dcast(df1, rn ~ rowid(rn, prefix = "Col"), value.var = "detail")
rn Col1 Col2 Col3
1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
为了检测开始新记录的行,即带有日期的行,此方法借鉴了 as well as from 。
dcast()
允许将所有内容打包成一个“单行”(如果不计算 library()
调用):
library(data.table)
library(lubridate)
dcast(setDT(df1), cumsum(!is.na(dmy(detail))) ~ rowid(cumsum(!is.na(dmy(detail))), prefix = "Col"),
value.var = "detail")
我正在尝试按模式合并行。
数据框只有一列(字符串),通常,它应该遵循日期、company_name 和薪水的模式。但是,有些情况就是没有工资。
有没有一种方法可以按日期模式合并行?通过这样做,我可以稍后将它们分成几列。我不想早点做 pivot_wider 的原因是公司名称和薪水之间很可能不匹配 - 不平衡的行。所以我认为最好按日期模式合并行,因为日期永远不会丢失并遵循模式。
数据集:
# A tibble: 10 x 1
detail
<chr>
1 26 January 2021
2 NatWest Group - Bristol, BS2 0PT
3 26 January 2021
4 NatWest Group - Manchester, M3 3AQ
5 15 February 2021
6 Brook Street - Liverpool, Merseyside, L21AB
7 £13.84 per hour
8 16 February 2021
9 Anglo Technical Recruitment - London, WC2N 5DU
10 £400.00 per day
数据集的输入:
structure(list(detail = c("26 January 2021", "NatWest Group - Bristol, BS2 0PT",
"26 January 2021", "NatWest Group - Manchester, M3 3AQ", "15 February 2021",
"Brook Street - Liverpool, Merseyside, L21AB", "£13.84 per hour",
"16 February 2021", "Anglo Technical Recruitment - London, WC2N 5DU",
"£400.00 per day")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
预期结果:
detail
<chr>
1 26 January 2021 NatWest Group - Bristol, BS2 0PT
2 26 January 2021 NatWest Group - Manchester, M3 3AQ
3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
dput 预期结果:
df <- structure(list(detail = c("26 January 2021 NatWest Group - Bristol, BS2 0PT",
"26 January 2021 NatWest Group - Manchester, M3 3AQ", "15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour",
"16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
在每行前面加上一个标签,然后使用 read.dcf
创建一个 3 列字符矩阵 mat
。最后,我们将其转换为每个逻辑记录一个元素的字符向量,但您可能只想使用 mat
,因为这似乎是一种更有用的格式。
我们假设日期采用 %d %B %Y 格式(请参阅 ?strptime
了解百分比代码),工资行以 £ 开头,其他行是地址行。
library(dplyr)
mat <- dat %>%
mutate(detail = case_when(
!is.na(as.Date(detail, "%d %B %Y")) ~ paste("\nDate:", detail),
grepl("^£", detail) ~ paste("Salary:", detail),
TRUE ~ paste("Address:", detail))) %>%
{ read.dcf(textConnection(.$detail)) }
mat %>%
apply(1, toString) %>%
sub(", NA$", "", .)
更新
简化假设和代码。
另一种解决方案假设仅第一行包含日期。无论两个日期之间的行数如何,它都会起作用..
library(tidyverse)
df %>% group_by(d = cumsum(str_detect(detail, "^(^\d\d? \w+ \d{4})$"))) %>%
mutate(c = paste0("Col", as.character(row_number()))) %>%
pivot_wider(id_cols = d, values_from = detail, names_from = c)
# A tibble: 4 x 4
# Groups: d [4]
d Col1 Col2 Col3
<int> <chr> <chr> <chr>
1 1 26 January 2021 NatWest Group - Bristol, BS2 0PT NA
2 2 26 January 2021 NatWest Group - Manchester, M3 3AQ NA
3 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
4 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
这是一个纯粹的data.table
方法
library( data.table )
#make it a data.table
setDT( df )
#first, summarise by block separated by days, collapse the text, using @@ as separator
ans <- df[, .( paste0( detail, collapse = "@@") ),
by = .(d = cumsum( ( grepl( "[0-9]{2} [a-zA-Z]+ [0-9]{4}", detail) ) ) ) ]
#split text again to cols, based on te @@ introduced in the collapse/ Number of cols is dynamic!
ans[, paste0( "Col", 1:length( tstrsplit(ans$V1, "@@" ))) := tstrsplit( V1, "@@" )][, V1 := NULL ][]
# d Col1 Col2 Col3
# 1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA>
# 2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA>
# 3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour
# 4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
这是一个 data.table 方法,它使用 dcast()
和 rowid()
重塑为宽格式。它returns一个data.table有四列:记录号,日期,
company_name,还有工资。
library(data.table)
setDT(df1)[, rn := cumsum(!is.na(lubridate::dmy(detail)))]
dcast(df1, rn ~ rowid(rn, prefix = "Col"), value.var = "detail")
rn Col1 Col2 Col3 1: 1 26 January 2021 NatWest Group - Bristol, BS2 0PT <NA> 2: 2 26 January 2021 NatWest Group - Manchester, M3 3AQ <NA> 3: 3 15 February 2021 Brook Street - Liverpool, Merseyside, L21AB £13.84 per hour 4: 4 16 February 2021 Anglo Technical Recruitment - London, WC2N 5DU £400.00 per day
为了检测开始新记录的行,即带有日期的行,此方法借鉴了
dcast()
允许将所有内容打包成一个“单行”(如果不计算 library()
调用):
library(data.table)
library(lubridate)
dcast(setDT(df1), cumsum(!is.na(dmy(detail))) ~ rowid(cumsum(!is.na(dmy(detail))), prefix = "Col"),
value.var = "detail")