如何在 R 中使用 stringr 获取 select 适当街道名称的正则表达式?

How can I get regular expressions to select appropriate street names using stringr in R?

我刚刚开始使用正则表达式(使用 stringr 程序包),并且我编写的一些代码并没有完全按照我的要求执行。我正在处理一个包含一些非常混乱的字符串数据的数据集,并试图清理它以便与 google 地图 API.

一起使用

我附上了下面的数据样本。

基本上,我想 select 每一行,其中 loc_01 是一个简单的街道名称。通过这个,我的意思是我希望它采用以下格式:

编号的街道,例如10th Ave;命名的街道,例如 MAIN ST,以及此类街道名称的任何定向修改(例如 10TH AVE NW, W MAIN ST, or W 10TH AVE。)

我试过以下表达式:

df %>% filter(str_detect(loc_01, "^(\w+)?(\s)?.*(\s)AVE|ST|BLVD(\w+)?$"))

但这给了我 10 AVE 1300 BLK E 这样的输出,这不是我想要 select 的观察结果。我将正则表达式解释为:

很明显,我的解释是错误的,因为我得到了 10 AVD 1300 BLK E 这样的东西。在这种情况下,为了获得我想要的结果,正确的正则表达式是什么?

非常感谢您的帮助!

structure(list(ID = c("387", "404", "422", "425", "432", "443", 
"526", "536", "580", "658", "665", "666", "735", "880", "910", 
"911", "912", "913", "916", "917", "972", "1098", "1194", "1231", 
"1298", "1309", "1310", "1311", "1312", "1316", "1328", "1354", 
"1371", "1373", "1374", "1376", "1381", "1388", "1389", "1390", 
"1391", "1392", "1393", "1406", "1407", "1408", "1409", "1410", 
"1411", "1412", "1413", "1414", "1418", "1420", "1422", "1429", 
"1430", "1433", "1434", "1437", "1441", "1442", "1443", "1444", 
"1445", "1448", "1451", "1452", "1453", "1454", "1455", "1457", 
"1461", "1462", "1463", "1464", "1466", "1468", "1470", "1471", 
"1473", "1479", "1480", "1481", "1486", "1489", "1490", "1493", 
"1495", "1496", "1498", "1502", "1503", "1509", "1511", "1512", 
"1513", "1517", "1", "2"), city = c("DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER", 
"DENVER", "DENVER", "DENVER", "DENVER", "DENVER", "DENVER"), 
    loc_01 = c("#50 S KALAMATH ST", "00 BLKS BRYANT CANOSA", 
    "000 BLK ALLEY", "000 BLK BROADWAY", "000 BLK E 11TH AV", 
    "000 BLK E 17TH", "000 BLK S BROADWAY", "000 BLK S IRVING JULIAN", 
    "000 BLK W ALAMEDA AV", "10 AVE 1300 BLK E", "100 BLK ALLEY N BROADWAY/N ACOMA", 
    "100 BLK ALLEY S", "100 BLK N WASHINGTON ST", "1000 ALLEY LINCOLN/BROADWAY", 
    "1000 BLK ALLEY CHEROKEE/DELAWARE", "1000 BLK ALLEY GRANT", 
    "1000 BLK ALLEY MARTIN/LAFAYETT", "1000 BLK ALLEY MONROE/GARFIELD", 
    "1000 BLK ALLEY OGDEN", "1000 BLK ALLEY S GAYLORD ST", "1000 BLK E GAY", 
    "1000 BLK S VINE/GAYLORD ALLEY", "1010 CURTIS ST", "1050 ODELL ST", 
    "109TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", 
    "10TH AVE", "10TH AVE", "10TH AVE", "10TH AVE", "E 10TH AVE", 
    "MAIN ST NW"), link = c("", "", "", "00125FN", "00025FW", 
    "AT", "", "00050FS", "00005FW", "00100FE", "00043FN", "", 
    "", "", "", "AT", "", "00120FS", "", "00070FN", "", "00200FS", 
    "", "", "00020FS", "09999FN", "AT", "AT", "AT", "AT", "AT", 
    "AT", "AT", "00080FW", "00175FW", "AT", "00101FW", "AT", 
    "AT", "AT", "AT", "AT", "AT", "00060FE", "00120FS", "AT", 
    "AT", "AT", "AT", "00015FW", "00035FW", "00075FW", "00022FE", 
    "00144FW", "00250FE", "AT", "AT", "00037FW", "00100FE", "00200FW", 
    "AT", "AT", "00084FW", "00100FW", "AT", "00100FN", "AT", 
    "AT", "AT", "AT", "AT", "00100FW", "00068FE", "00136FE", 
    "00200FE", "00150FW", "AT", "00020FE", "00020FW", "00030FE", 
    "00045FW", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "00163FE", 
    "AT", "AT", "AT", "AT", "AT", "AT", "AT", "00100FE", "00020FW", 
    "", ""), loc_02 = c("N SIDE OF BLDG", "ALLEY", "KNOX CT/KING ST", 
    "IRVINGTON PL", "N LINDA ST", "POLE 94 79", "PKG METER BS-46", 
    "ALLEY", "POLE 844/005", "MARION ST N", "W 1ST AV", "MEADE/NEWTON", 
    "E 1ST AVE", "10 AV", "W 11TH AVE", "LOGAN ST", "E 11TH AV", 
    "E 11TH AVE", "CORONA ST", "E MISSISSIPPI AVE", "CORONA ST", 
    "E TENNESSEE AVE", "31ST STREET", "E 11TH AVE", "QUEENSBURG ST", 
    "10TH AVE 1407 E", "10TH AVE 2900 BLK W", "10TH AVE 300 BLK E", 
    "10TH AVE 3200 BLK W", "1295 W", "2900 W", "500 E", "ACOMA / BANNOCK ALLEY", 
    "ACOMA ST", "ACOMA ST", "ADAMS ST", "BANNOCK ST", "BANNOCK ST", 
    "BANNOCK ST", "BANNOCK ST", "BANNOCK ST", "BANNOCK ST", "BANNOCK ST N", 
    "BROADWAY ST", "BROADWAY ST", "BROADWAY ST", "BROADWAY ST", 
    "BROADWAY ST", "BROADWAY ST", "BROADWAY ST N", "BRYANT ST", 
    "BRYANT ST", "CLARKSON ST", "CLARKSON ST", "CLARKSON ST", 
    "CLARKSON ST", "CLARKSON ST", "CORONA ST", "CORONA ST", "CORONA ST", 
    "CORONA ST", "CORONA ST", "CORONA ST N", "DECATUR ST", "DECATUR ST", 
    "DOWNING ST", "DOWNING ST", "DOWNING ST", "DOWNING ST", "DOWNING ST", 
    "DOWNING ST", "DOWNING ST N", "FEDERAL BLVD", "FEDERAL BLVD", 
    "FEDERAL BLVD", "GALAPAGO ST", "GARFIELD ST", "GRANT ST", 
    "GRANT ST", "GRANT ST", "GRANT ST", "GRANT ST", "GRANT ST", 
    "GRANT ST", "GROVE ST", "GROVE ST", "GROVE ST", "HOOKER ST", 
    "HUMBOLDT ST", "HUMBOLDT ST", "INCA ST", "KALAMATH ST", "KALAMATH ST", 
    "KNOX CT", "KNOX CT", "KNOX CT", "LAFAYETTE ST", "LINCOLN ST", 
    "MAIN ST", "100TH AVE")), row.names = c(NA, -100L), class = "data.frame")

一种方法是添加一个额外的 filter 语句(尽管我确信有更好的方法)。

library(tidyverse)

df %>%
    filter(str_detect(loc_01, "^(\w+)?(\s)?.*(\s)AVE|ST|BLVD(\w+)?$")) %>%
    filter(!str_detect(loc_01, 'BLK'))

输出

     ID   city            loc_01    link                loc_02
1   387 DENVER #50 S KALAMATH ST                N SIDE OF BLDG
2  1194 DENVER    1010 CURTIS ST                   31ST STREET
3  1231 DENVER     1050 ODELL ST                    E 11TH AVE
4  1298 DENVER         109TH AVE 00020FS         QUEENSBURG ST
5  1309 DENVER          10TH AVE 09999FN       10TH AVE 1407 E
6  1310 DENVER          10TH AVE      AT   10TH AVE 2900 BLK W
7  1311 DENVER          10TH AVE      AT    10TH AVE 300 BLK E
8  1312 DENVER          10TH AVE      AT   10TH AVE 3200 BLK W
9  1316 DENVER          10TH AVE      AT                1295 W
10 1328 DENVER          10TH AVE      AT                2900 W
11 1354 DENVER          10TH AVE      AT                 500 E
12 1371 DENVER          10TH AVE      AT ACOMA / BANNOCK ALLEY
13 1373 DENVER          10TH AVE 00080FW              ACOMA ST
14 1374 DENVER          10TH AVE 00175FW              ACOMA ST
15 1376 DENVER          10TH AVE      AT              ADAMS ST
16 1381 DENVER          10TH AVE 00101FW            BANNOCK ST
17 1388 DENVER          10TH AVE      AT            BANNOCK ST
18 1389 DENVER          10TH AVE      AT            BANNOCK ST
19 1390 DENVER          10TH AVE      AT            BANNOCK ST
20 1391 DENVER          10TH AVE      AT            BANNOCK ST
21 1392 DENVER          10TH AVE      AT            BANNOCK ST
22 1393 DENVER          10TH AVE      AT          BANNOCK ST N
23 1406 DENVER          10TH AVE 00060FE           BROADWAY ST
24 1407 DENVER          10TH AVE 00120FS           BROADWAY ST
25 1408 DENVER          10TH AVE      AT           BROADWAY ST
26 1409 DENVER          10TH AVE      AT           BROADWAY ST
27 1410 DENVER          10TH AVE      AT           BROADWAY ST
28 1411 DENVER          10TH AVE      AT           BROADWAY ST
29 1412 DENVER          10TH AVE 00015FW         BROADWAY ST N
30 1413 DENVER          10TH AVE 00035FW             BRYANT ST
31 1414 DENVER          10TH AVE 00075FW             BRYANT ST
32 1418 DENVER          10TH AVE 00022FE           CLARKSON ST
33 1420 DENVER          10TH AVE 00144FW           CLARKSON ST
34 1422 DENVER          10TH AVE 00250FE           CLARKSON ST
35 1429 DENVER          10TH AVE      AT           CLARKSON ST
36 1430 DENVER          10TH AVE      AT           CLARKSON ST
37 1433 DENVER          10TH AVE 00037FW             CORONA ST
38 1434 DENVER          10TH AVE 00100FE             CORONA ST
39 1437 DENVER          10TH AVE 00200FW             CORONA ST
40 1441 DENVER          10TH AVE      AT             CORONA ST
41 1442 DENVER          10TH AVE      AT             CORONA ST
42 1443 DENVER          10TH AVE 00084FW           CORONA ST N
43 1444 DENVER          10TH AVE 00100FW            DECATUR ST
44 1445 DENVER          10TH AVE      AT            DECATUR ST
45 1448 DENVER          10TH AVE 00100FN            DOWNING ST
46 1451 DENVER          10TH AVE      AT            DOWNING ST
47 1452 DENVER          10TH AVE      AT            DOWNING ST
48 1453 DENVER          10TH AVE      AT            DOWNING ST
49 1454 DENVER          10TH AVE      AT            DOWNING ST
50 1455 DENVER          10TH AVE      AT            DOWNING ST
51 1457 DENVER          10TH AVE 00100FW          DOWNING ST N
52 1461 DENVER          10TH AVE 00068FE          FEDERAL BLVD
53 1462 DENVER          10TH AVE 00136FE          FEDERAL BLVD
54 1463 DENVER          10TH AVE 00200FE          FEDERAL BLVD
55 1464 DENVER          10TH AVE 00150FW           GALAPAGO ST
56 1466 DENVER          10TH AVE      AT           GARFIELD ST
57 1468 DENVER          10TH AVE 00020FE              GRANT ST
58 1470 DENVER          10TH AVE 00020FW              GRANT ST
59 1471 DENVER          10TH AVE 00030FE              GRANT ST
60 1473 DENVER          10TH AVE 00045FW              GRANT ST
61 1479 DENVER          10TH AVE      AT              GRANT ST
62 1480 DENVER          10TH AVE      AT              GRANT ST
63 1481 DENVER          10TH AVE      AT              GRANT ST
64 1486 DENVER          10TH AVE      AT              GROVE ST
65 1489 DENVER          10TH AVE      AT              GROVE ST
66 1490 DENVER          10TH AVE      AT              GROVE ST
67 1493 DENVER          10TH AVE      AT             HOOKER ST
68 1495 DENVER          10TH AVE 00163FE           HUMBOLDT ST
69 1496 DENVER          10TH AVE      AT           HUMBOLDT ST
70 1498 DENVER          10TH AVE      AT               INCA ST
71 1502 DENVER          10TH AVE      AT           KALAMATH ST
72 1503 DENVER          10TH AVE      AT           KALAMATH ST
73 1509 DENVER          10TH AVE      AT               KNOX CT
74 1511 DENVER          10TH AVE      AT               KNOX CT
75 1512 DENVER          10TH AVE      AT               KNOX CT
76 1513 DENVER          10TH AVE 00100FE          LAFAYETTE ST
77 1517 DENVER          10TH AVE 00020FW            LINCOLN ST
78    1 DENVER        E 10TH AVE                       MAIN ST
79    2 DENVER        MAIN ST NW                     100TH AVE

如果有多个字符串导致问题,那么您可以创建一个列表并将其放入第二个过滤语句中。所以,如果你想删除带有 #50:

的行
remove.list <- paste(c("#", "BLK"), collapse = '|')

df %>%
    filter(str_detect(loc_01, "^(\w+)?(\s)?.*(\s)AVE|ST|BLVD(\w+)?$")) %>%
    filter(!str_detect(loc_01, remove.list))

输出

head()

     ID   city         loc_01    link                loc_02
1  1194 DENVER 1010 CURTIS ST                   31ST STREET
2  1231 DENVER  1050 ODELL ST                    E 11TH AVE
3  1298 DENVER      109TH AVE 00020FS         QUEENSBURG ST
4  1309 DENVER       10TH AVE 09999FN       10TH AVE 1407 E
5  1310 DENVER       10TH AVE      AT   10TH AVE 2900 BLK W
6  1311 DENVER       10TH AVE      AT    10TH AVE 300 BLK E
7  1312 DENVER       10TH AVE      AT   10TH AVE 3200 BLK W
8  1316 DENVER       10TH AVE      AT                1295 W
9  1328 DENVER       10TH AVE      AT                2900 W
10 1354 DENVER       10TH AVE      AT                 500 E

对于 filter loc_02,我们可以添加一个额外的过滤语句,以保留以数字开头并以方向结尾的行。

df %>%
  filter(str_detect(loc_01, "^(\w+)?(\s)?.*(\s)AVE|ST|BLVD(\w+)?$")) %>%
  filter(!str_detect(loc_01, 'BLK')) %>%
  filter(str_detect(loc_02, "^[[:digit:]]+( N| S| E| W| NE| NW| SE| SW)$"))

# Or you could write it like this:
# df %>%
#   filter(str_detect(loc_01, "^(\w+)?(\s)?.*(\s)AVE|ST|BLVD(\w+)?$")) %>%
#   filter(!str_detect(loc_01, 'BLK')) %>%
#   filter(str_detect(loc_02, paste("^\d+(\s)", "(", direction_abbrev, ")","$", sep = "")))

输出

    ID   city   loc_01 link loc_02
1 1316 DENVER 10TH AVE   AT 1295 W
2 1328 DENVER 10TH AVE   AT 2900 W
3 1354 DENVER 10TH AVE   AT  500 E