str_remove_* 在 stringr 包中的意外行为

Unexpected behavior by str_remove_* in stringr package

所以我正在处理一组广泛使用 stringr 的非常简单的任务。一项任务是从字符串中删除特定模式。

下面是一个包含列 tempcurrent_house 的玩具示例。我想从 temp 中删除 current_house 中给出的模式并创建一个新列,比如 temp2。对于一些观察,似乎 str_remove() 不起作用。我已经尝试过 str_remove_all() 但没有成功。

我错过了什么?它应该不是与搜索模式中的标记数量相关的问题,因为它成功地删除了多词模式。

library(data.table)
library(stringr)


head(df)
#>            temp current_house
#> 1: Lazard 528 2        Lazard
#> 2:   KPMG 525 1          KPMG
#> 3:   KPMG 525 1          KPMG
#> 4:   KPMG 524 4          KPMG
#> 5:   KPMG 524 4          KPMG
#> 6:   KPMG 524 4          KPMG

# adding the new column temp2 by removing the pattern current_house
df[ , temp2 := str_remove(temp, current_house)]

df
#>                                                       temp
#>  1:                                           Lazard 528 2
#>  2:                                             KPMG 525 1
#>  3:                                             KPMG 525 1
#>  4:                                             KPMG 524 4
#>  5:                                             KPMG 524 4
#>  6:                                             KPMG 524 4
#>  7:                                             KPMG 524 4
#>  8: Development and Investment Bank of Turkey (TKYB) 524 4
#>  9: Development and Investment Bank of Turkey (TKYB) 524 4
#> 10: Development and Investment Bank of Turkey (TKYB) 524 4
#> 11: Development and Investment Bank of Turkey (TKYB) 524 4
#> 12: Development and Investment Bank of Turkey (TKYB) 524 4
#> 13: Development and Investment Bank of Turkey (TKYB) 524 4
#> 14: Development and Investment Bank of Turkey (TKYB) 524 4
#> 15: Development and Investment Bank of Turkey (TKYB) 524 4
#> 16:                                         Investec 520 4
#> 17:                         Numis Securities Limited 520 2
#> 18:                         Numis Securities Limited 520 1
#> 19:                                JPMorgan Cazenove 520 1
#> 20:                                JPMorgan Cazenove 520 1
#> 21:                  Fenchurch Advisory Partners LLP 520 1
#> 22:                  Fenchurch Advisory Partners LLP 520 1
#> 23:                  Fenchurch Advisory Partners LLP 520 1
#> 24:                                              EY 518 16
#> 25:                                             KPMG 508 1
#> 26:                                             KPMG 508 1
#> 27:           Capitalmind Corporate Finance Advisory 502 2
#> 28:           Capitalmind Corporate Finance Advisory 502 1
#> 29:             Daiwa Securities Group / DC Advisory 500 3
#> 30:                           LionTree Advisors, LLC 500 1
#> 31:                           LionTree Advisors, LLC 500 1
#> 32:                      Ping'an Securities Co.,Ltd. 496 1
#> 33:                      Ping'an Securities Co.,Ltd. 496 1
#> 34:                      Ping'an Securities Co.,Ltd. 496 1
#> 35:                      Ping'an Securities Co.,Ltd. 496 1
#> 36:                      Ping'an Securities Co.,Ltd. 496 1
#> 37:                      Ping'an Securities Co.,Ltd. 496 1
#> 38:                   Guotai Junan Securities Co Ltd 496 1
#> 39:                   Guotai Junan Securities Co Ltd 496 1
#> 40:                                              EY 493 16
#>                                                       temp
#>                                        current_house
#>  1:                                           Lazard
#>  2:                                             KPMG
#>  3:                                             KPMG
#>  4:                                             KPMG
#>  5:                                             KPMG
#>  6:                                             KPMG
#>  7:                                             KPMG
#>  8: Development and Investment Bank of Turkey (TKYB)
#>  9: Development and Investment Bank of Turkey (TKYB)
#> 10: Development and Investment Bank of Turkey (TKYB)
#> 11: Development and Investment Bank of Turkey (TKYB)
#> 12: Development and Investment Bank of Turkey (TKYB)
#> 13: Development and Investment Bank of Turkey (TKYB)
#> 14: Development and Investment Bank of Turkey (TKYB)
#> 15: Development and Investment Bank of Turkey (TKYB)
#> 16:                                         Investec
#> 17:                         Numis Securities Limited
#> 18:                         Numis Securities Limited
#> 19:                                JPMorgan Cazenove
#> 20:                                JPMorgan Cazenove
#> 21:                  Fenchurch Advisory Partners LLP
#> 22:                  Fenchurch Advisory Partners LLP
#> 23:                  Fenchurch Advisory Partners LLP
#> 24:                                               EY
#> 25:                                             KPMG
#> 26:                                             KPMG
#> 27:           Capitalmind Corporate Finance Advisory
#> 28:           Capitalmind Corporate Finance Advisory
#> 29:             Daiwa Securities Group / DC Advisory
#> 30:                           LionTree Advisors, LLC
#> 31:                           LionTree Advisors, LLC
#> 32:                      Ping'an Securities Co.,Ltd.
#> 33:                      Ping'an Securities Co.,Ltd.
#> 34:                      Ping'an Securities Co.,Ltd.
#> 35:                      Ping'an Securities Co.,Ltd.
#> 36:                      Ping'an Securities Co.,Ltd.
#> 37:                      Ping'an Securities Co.,Ltd.
#> 38:                   Guotai Junan Securities Co Ltd
#> 39:                   Guotai Junan Securities Co Ltd
#> 40:                                               EY
#>                                        current_house
#>                                                      temp2
#>  1:                                                  528 2
#>  2:                                                  525 1
#>  3:                                                  525 1
#>  4:                                                  524 4
#>  5:                                                  524 4
#>  6:                                                  524 4
#>  7:                                                  524 4
#>  8: Development and Investment Bank of Turkey (TKYB) 524 4
#>  9: Development and Investment Bank of Turkey (TKYB) 524 4
#> 10: Development and Investment Bank of Turkey (TKYB) 524 4
#> 11: Development and Investment Bank of Turkey (TKYB) 524 4
#> 12: Development and Investment Bank of Turkey (TKYB) 524 4
#> 13: Development and Investment Bank of Turkey (TKYB) 524 4
#> 14: Development and Investment Bank of Turkey (TKYB) 524 4
#> 15: Development and Investment Bank of Turkey (TKYB) 524 4
#> 16:                                                  520 4
#> 17:                                                  520 2
#> 18:                                                  520 1
#> 19:                                                  520 1
#> 20:                                                  520 1
#> 21:                                                  520 1
#> 22:                                                  520 1
#> 23:                                                  520 1
#> 24:                                                 518 16
#> 25:                                                  508 1
#> 26:                                                  508 1
#> 27:                                                  502 2
#> 28:                                                  502 1
#> 29:                                                  500 3
#> 30:                                                  500 1
#> 31:                                                  500 1
#> 32:                                                  496 1
#> 33:                                                  496 1
#> 34:                                                  496 1
#> 35:                                                  496 1
#> 36:                                                  496 1
#> 37:                                                  496 1
#> 38:                                                  496 1
#> 39:                                                  496 1
#> 40:                                                 493 16
#>                                                      temp2

reprex package (v2.0.1)

于 2022-01-20 创建

请在下面找到玩具样品。

df = structure(list(temp = c("Lazard 528 2", "KPMG 525 1", "KPMG 525 1", 
                             "KPMG 524 4", "KPMG 524 4", "KPMG 524 4", "KPMG 524 4", "Development and Investment Bank of Turkey (TKYB) 524 4", 
                             "Development and Investment Bank of Turkey (TKYB) 524 4", "Development and Investment Bank of Turkey (TKYB) 524 4", 
                             "Development and Investment Bank of Turkey (TKYB) 524 4", "Development and Investment Bank of Turkey (TKYB) 524 4", 
                             "Development and Investment Bank of Turkey (TKYB) 524 4", "Development and Investment Bank of Turkey (TKYB) 524 4", 
                             "Development and Investment Bank of Turkey (TKYB) 524 4", "Investec 520 4", 
                             "Numis Securities Limited 520 2", "Numis Securities Limited 520 1", 
                             "JPMorgan Cazenove 520 1", "JPMorgan Cazenove 520 1", "Fenchurch Advisory Partners LLP 520 1", 
                             "Fenchurch Advisory Partners LLP 520 1", "Fenchurch Advisory Partners LLP 520 1", 
                             "EY 518 16", "KPMG 508 1", "KPMG 508 1", "Capitalmind Corporate Finance Advisory 502 2", 
                             "Capitalmind Corporate Finance Advisory 502 1", "Daiwa Securities Group / DC Advisory 500 3", 
                             "LionTree Advisors, LLC 500 1", "LionTree Advisors, LLC 500 1", 
                             "Ping'an Securities Co.,Ltd. 496 1", "Ping'an Securities Co.,Ltd. 496 1", 
                             "Ping'an Securities Co.,Ltd. 496 1", "Ping'an Securities Co.,Ltd. 496 1", 
                             "Ping'an Securities Co.,Ltd. 496 1", "Ping'an Securities Co.,Ltd. 496 1", 
                             "Guotai Junan Securities Co Ltd 496 1", "Guotai Junan Securities Co Ltd 496 1", 
                             "EY 493 16"), 
                    current_house = c("Lazard", "KPMG", "KPMG", "KPMG", 
                                      "KPMG", "KPMG", "KPMG", "Development and Investment Bank of Turkey (TKYB)", 
                                      "Development and Investment Bank of Turkey (TKYB)", "Development and Investment Bank of Turkey (TKYB)", 
                                      "Development and Investment Bank of Turkey (TKYB)", "Development and Investment Bank of Turkey (TKYB)", 
                                      "Development and Investment Bank of Turkey (TKYB)", "Development and Investment Bank of Turkey (TKYB)", 
                                      "Development and Investment Bank of Turkey (TKYB)", "Investec", 
                                      "Numis Securities Limited", "Numis Securities Limited", "JPMorgan Cazenove", 
                                      "JPMorgan Cazenove", "Fenchurch Advisory Partners LLP", "Fenchurch Advisory Partners LLP", 
                                      "Fenchurch Advisory Partners LLP", "EY", "KPMG", "KPMG", "Capitalmind Corporate Finance Advisory", 
                                      "Capitalmind Corporate Finance Advisory", "Daiwa Securities Group / DC Advisory", 
                                      "LionTree Advisors, LLC", "LionTree Advisors, LLC", "Ping'an Securities Co.,Ltd.", 
                                      "Ping'an Securities Co.,Ltd.", "Ping'an Securities Co.,Ltd.", 
                                      "Ping'an Securities Co.,Ltd.", "Ping'an Securities Co.,Ltd.", 
                                      "Ping'an Securities Co.,Ltd.", "Guotai Junan Securities Co Ltd", 
                                      "Guotai Junan Securities Co Ltd", "EY")), row.names = c(NA, -40L
                                      ), 
               class = c("data.table", "data.frame"))

current_house 中的括号被解释为正则表达式组。使用 stringr::fixed 修复:

setDT(df)
df[, temp2 := str_remove(temp, current_house)           # initial, not working
  ][, temp3 := str_remove(temp, fixed(current_house))   # working
  ][]
#                                        temp                           current_house                                   temp2   temp3
#                                      <char>                                  <char>                                  <char>  <char>
#  1:                            Lazard 528 2                                  Lazard                                   528 2   528 2
#  2:                              KPMG 525 1                                    KPMG                                   525 1   525 1
#  3:                              KPMG 525 1                                    KPMG                                   525 1   525 1
#  4:                              KPMG 524 4                                    KPMG                                   524 4   524 4
#  5:                              KPMG 524 4                                    KPMG                                   524 4   524 4
#  6:                              KPMG 524 4                                    KPMG                                   524 4   524 4
#  7:                              KPMG 524 4                                    KPMG                                   524 4   524 4
#  8: Development and Investment Bank of T... Development and Investment Bank of T... Development and Investment Bank of T...   524 4
#  9: Development and Investment Bank of T... Development and Investment Bank of T... Development and Investment Bank of T...   524 4
# 10: Development and Investment Bank of T... Development and Investment Bank of T... Development and Investment Bank of T...   524 4
# ---                                                                                                                                
# 31:            LionTree Advisors, LLC 500 1                  LionTree Advisors, LLC                                   500 1   500 1
# 32:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 33:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 34:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 35:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 36:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 37:       Ping'an Securities Co.,Ltd. 496 1             Ping'an Securities Co.,Ltd.                                   496 1   496 1
# 38:    Guotai Junan Securities Co Ltd 496 1          Guotai Junan Securities Co Ltd                                   496 1   496 1
# 39:    Guotai Junan Securities Co Ltd 496 1          Guotai Junan Securities Co Ltd                                   496 1   496 1
# 40:                               EY 493 16                                      EY                                  493 16  493 16

您可能想用 trimws(.) 包裹 str_remove,因为这里的 temp3 有前导空格:

head(df$temp3)
# [1] " 528 2" " 525 1" " 525 1" " 524 4" " 524 4" " 524 4"

df[, temp3 := trimws(str_remove(temp, fixed(current_house)))]
head(df$temp3)
# [1] "528 2" "525 1" "525 1" "524 4" "524 4" "524 4"