使用 strsplit 模式匹配将字符串扩展到多列
Spreading character string into multiple columns with strsplit pattern matching
这是我第一次从 PDF 文档中抓取 文本。我正在展示我认为对我正在做的事情最有用的数据格式,但我可能是错的。清理 PDF 文本后,我将其格式化为 tibble
(下图)。
我尝试使用 strsplit(dmt, \s+)
将字符串拆分为三个单独的列,但这只是将所有内容完全分开。我曾使用 str_squish()
来消除字符串中间文本部分的空格,但这对模式匹配没有帮助。
字符串的第一个数字部分有时以 )
或 number
结尾。这是我正在处理的内容:
dmt
# A tibble: 612 x 1
datamatrixtest[,1]
<chr>
1 110.05 Human Service Vehicle Inspection Reqd 6
2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
4 341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
5 341.04(1) Non-Registration of Vehicle 10,125
6 341.04(2) Improper Registration of Vehicle 4
7 341.15(1) Fail/Display Vehicle License Plates 2,010
8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
10 341.15(2) Improperly Attached License Plates 7
# ... with 602 more rows
理想情况下,我可以利用 strsplit
和准确的模式匹配将数据放入三个单独的列中。
dmt
# A tibble: 612 x 3
statute offense cases
<chr> <chr> <num>
1 110.05 Human Service Vehicle Inspection Reqd 6
2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
4 341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
5 341.04(1) Non-Registration of Vehicle 10,125
6 341.04(2) Improper Registration of Vehicle 4
7 341.15(1) Fail/Display Vehicle License Plates 2,010
8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
10 341.15(2) Improperly Attached License Plates 7
我假设您的数据基本上与呈现的一样,列之间有多个 spaces。换句话说,检查您的 dmt
是否与我在下面创建的那个相当。在那种情况下,我们可以像这样用 \s{2,}
拆分多个 space 的任何部分的每一行。如果您的数据不是这样,或者如果任何单个字段恰好包含多个 space,则使用 dput
和 head
提供示例,以便我们找到更精确的模式会工作。
library(tidyverse)
dmt <- read_lines(
"110.05 Human Service Vehicle Inspection Reqd 6
23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
341.04(1) Non-Registration of Vehicle 10,125
341.04(2) Improper Registration of Vehicle 4
341.15(1) Fail/Display Vehicle License Plates 2,010
341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
1341.15(2) Improperly Attached License Plates 7"
) %>%
enframe(name = NULL, value = "line")
dmt %>%
separate(line, c("statute", "offense", "cases"), sep = "\s{2,}") %>%
mutate(cases = cases %>% str_remove_all(",") %>% as.integer)
#> # A tibble: 10 x 3
#> statute offense cases
#> <chr> <chr> <int>
#> 1 110.05 Human Service Vehicle Inspection Reqd 6
#> 2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
#> 3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
#> 4 341.03 Oper Veh After Sus/Rev or Can of Reg 8862
#> 5 341.04(1) Non-Registration of Vehicle 10125
#> 6 341.04(2) Improper Registration of Vehicle 4
#> 7 341.15(1) Fail/Display Vehicle License Plates 2010
#> 8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
#> 9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
#> 10 1341.15(2) Improperly Attached License Plates 7
由 reprex package (v0.3.0)
于 2019-09-23 创建
根据您的数据,我们还可以通过定义不同的捕获组来使用tidyr::extract
。
library(dplyr)
library(tidyr)
df %>%
extract(datamatrixtest, into = c("statute", "offense", "cases"),
regex = "(.*?)\s(.*?)(\d.*)") %>%
mutate_all(trimws)
# statute offense cases
# <chr> <chr> <chr>
# 1 110.05 Human Service Vehicle Inspection Reqd 6
# 2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
# 3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
# 4 341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
# 5 341.04(1) Non-Registration of Vehicle 10,125
# 6 341.04(2) Improper Registration of Vehicle 4
# 7 341.15(1) Fail/Display Vehicle License Plates 2,010
# 8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
# 9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
#10 1341.15(2) Improperly Attached License Plates 7
在这里,我们定义了三个组,第一个从文本的开头开始,直到遇到第一个空格,第二个从第一个结束的地方开始,直到遇到一个数字,第三个从数字开始直到遇到一个数字句末.
数据
df <- structure(list(datamatrixtest = c("110.05 Human Service
Vehicle Inspection Reqd 6",
"23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1",
"23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1",
"341.03 Oper Veh After Sus/Rev or Can of Reg 8,862",
"341.04(1) Non-Registration of Vehicle 10,125",
"341.04(2) Improper Registration of Vehicle 4",
"341.15(1) Fail/Display Vehicle License Plates 2,010",
"341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3",
"341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2",
"1341.15(2) Improperly Attached License Plates 7"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
这是我第一次从 PDF 文档中抓取 文本。我正在展示我认为对我正在做的事情最有用的数据格式,但我可能是错的。清理 PDF 文本后,我将其格式化为 tibble
(下图)。
我尝试使用 strsplit(dmt, \s+)
将字符串拆分为三个单独的列,但这只是将所有内容完全分开。我曾使用 str_squish()
来消除字符串中间文本部分的空格,但这对模式匹配没有帮助。
字符串的第一个数字部分有时以 )
或 number
结尾。这是我正在处理的内容:
dmt
# A tibble: 612 x 1
datamatrixtest[,1]
<chr>
1 110.05 Human Service Vehicle Inspection Reqd 6
2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
4 341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
5 341.04(1) Non-Registration of Vehicle 10,125
6 341.04(2) Improper Registration of Vehicle 4
7 341.15(1) Fail/Display Vehicle License Plates 2,010
8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
10 341.15(2) Improperly Attached License Plates 7
# ... with 602 more rows
理想情况下,我可以利用 strsplit
和准确的模式匹配将数据放入三个单独的列中。
dmt
# A tibble: 612 x 3
statute offense cases
<chr> <chr> <num>
1 110.05 Human Service Vehicle Inspection Reqd 6
2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
4 341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
5 341.04(1) Non-Registration of Vehicle 10,125
6 341.04(2) Improper Registration of Vehicle 4
7 341.15(1) Fail/Display Vehicle License Plates 2,010
8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
10 341.15(2) Improperly Attached License Plates 7
我假设您的数据基本上与呈现的一样,列之间有多个 spaces。换句话说,检查您的 dmt
是否与我在下面创建的那个相当。在那种情况下,我们可以像这样用 \s{2,}
拆分多个 space 的任何部分的每一行。如果您的数据不是这样,或者如果任何单个字段恰好包含多个 space,则使用 dput
和 head
提供示例,以便我们找到更精确的模式会工作。
library(tidyverse)
dmt <- read_lines(
"110.05 Human Service Vehicle Inspection Reqd 6
23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
341.04(1) Non-Registration of Vehicle 10,125
341.04(2) Improper Registration of Vehicle 4
341.15(1) Fail/Display Vehicle License Plates 2,010
341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
1341.15(2) Improperly Attached License Plates 7"
) %>%
enframe(name = NULL, value = "line")
dmt %>%
separate(line, c("statute", "offense", "cases"), sep = "\s{2,}") %>%
mutate(cases = cases %>% str_remove_all(",") %>% as.integer)
#> # A tibble: 10 x 3
#> statute offense cases
#> <chr> <chr> <int>
#> 1 110.05 Human Service Vehicle Inspection Reqd 6
#> 2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
#> 3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
#> 4 341.03 Oper Veh After Sus/Rev or Can of Reg 8862
#> 5 341.04(1) Non-Registration of Vehicle 10125
#> 6 341.04(2) Improper Registration of Vehicle 4
#> 7 341.15(1) Fail/Display Vehicle License Plates 2010
#> 8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
#> 9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
#> 10 1341.15(2) Improperly Attached License Plates 7
由 reprex package (v0.3.0)
于 2019-09-23 创建根据您的数据,我们还可以通过定义不同的捕获组来使用tidyr::extract
。
library(dplyr)
library(tidyr)
df %>%
extract(datamatrixtest, into = c("statute", "offense", "cases"),
regex = "(.*?)\s(.*?)(\d.*)") %>%
mutate_all(trimws)
# statute offense cases
# <chr> <chr> <chr>
# 1 110.05 Human Service Vehicle Inspection Reqd 6
# 2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
# 3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
# 4 341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
# 5 341.04(1) Non-Registration of Vehicle 10,125
# 6 341.04(2) Improper Registration of Vehicle 4
# 7 341.15(1) Fail/Display Vehicle License Plates 2,010
# 8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
# 9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
#10 1341.15(2) Improperly Attached License Plates 7
在这里,我们定义了三个组,第一个从文本的开头开始,直到遇到第一个空格,第二个从第一个结束的地方开始,直到遇到一个数字,第三个从数字开始直到遇到一个数字句末.
数据
df <- structure(list(datamatrixtest = c("110.05 Human Service
Vehicle Inspection Reqd 6",
"23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1",
"23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1",
"341.03 Oper Veh After Sus/Rev or Can of Reg 8,862",
"341.04(1) Non-Registration of Vehicle 10,125",
"341.04(2) Improper Registration of Vehicle 4",
"341.15(1) Fail/Display Vehicle License Plates 2,010",
"341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3",
"341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2",
"1341.15(2) Improperly Attached License Plates 7"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))