如何从 R 中的字符串中删除这些元素

How to remove these elements from my string in R

我的一个专栏包含以下字符串:

    259333154-Carat Programmatic»FCO»O»EV3»D&B - FCO Prospects ABM DDD 2020 D&B ABM ENDING 1/31/2020»NA+728 x 90»IN»UNV»TP»PM»dCPM»NAT»BTP»RON»NA»DB»N/A»ENG»M»P159WXZ
    238114259-Carat Programmatic»CPO XT5»O»EV2»END DATE 2/28/19 Google Custom Intent - XT5 CPO Google Custom Intent Audience/In Market Audience»NA+160 x 600»IN»DSK»TP»PM»CPM»NAT»BTS»RON»NA»DB»N/A»ENG»M»PX5LV6
    251368220-Carat Programmatic»XT6»O»EV1»END DATE 9/30/19 2019   Cadillac SMRT Always On - CRM - SMRT Segment XT6 Desktop Display»NA+160 x 600»IN»DSK»TP»PM»CPM»NAT»PRS»RON»NA»DB»N/A»ENG»M»P12LQG3
    235105211-Ebay»Silverado 1500»M»CON»ended   3/20 - ROS - eBay Run of Motors Desktop»NA+300 x 250»IN»DSK»TP»PM»CPM»NAT»SCR»ROS»NA»DB»N/A»ENG»M»PW79JH
    234990143-Carat Programmatic»XT4»O»EV2»Endemic - Loyalist|Edmunds&Oracle|N/A|Vehicle|N/A 2»NA+300 x 250»IN»MOB»TP»PM»CPM»NAT»BTO»RON»IMW»DB»N/A»ENG»M»PW7NSN

我的任务是从字符串中删除以下内容:

  1. 日期(例如 1/31/20203/20
  2. EndingENDINGENDendEndEND DATE 这样的字符串,但不是字符串的“结尾”像上一个一样 endemic
  3. 双空格,例如" "

我真的很难过。我正在使用如下几行代码来执行一些操作,但不知道如何更简洁、更完整地执行此操作:

dat = gsub("2017|2018|2019|2020", "" ,dat)
dat = gsub("»»", "»", dat)
dat = gsub("END ", "", dat)
dat = gsub("end ", "", dat)
dat = gsub("Ended ", "", dat)
dat = gsub("ENDED ", "", dat)
dat = gsub("DATE|date|Date", "", dat)

非常感谢您的帮助!

检查以下内容:

  • \b(?:0?[1-9]|1[012])(?:[-/.](?:0?[1-9]|[12][0-9]|3[01]))?[-/.](?:19|20)?\d\d\b 应该处理“日期(例如 1/31/2020、3/20)”案例
  • (?i)\bEnd(?: DATE|(?:ing)?)\b 应该处理诸如“Ending”、“ENDING”、“END”、“end”、“End”、“END DATE”之类的“字符串”,但不能处理“end”像上一个一样“流行”在他们身上”案例
  • ([\s»]){2,} 应该处理“双空格,例如”“”的情况。

合并所有:

gsub("\b(?:End(?:\s+DATE|(?:ing)?)|(?:0?[1-9]|1[012])(?:[-/.](?:0?[1-9]|[12][0-9]|3[01]))?[-/.](?:19|20)?\d\d)\b|([\s»]){2,}", "\1", x, perl=TRUE, ignore.case=TRUE)

参见 proof

说明

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    End                      'End'
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
       DATE                    ' DATE'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      (?:                      group, but do not capture (optional
                               (matching the most amount possible)):
--------------------------------------------------------------------------------
        ing                      'ing'
--------------------------------------------------------------------------------
      )?                       end of grouping
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      0?                       '0' (optional (matching the most
                               amount possible))
--------------------------------------------------------------------------------
      [1-9]                    any character of: '1' to '9'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      1                        '1'
--------------------------------------------------------------------------------
      [012]                    any character of: '0', '1', '2'
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      [-/.]                    any character of: '-', '/', '.'
--------------------------------------------------------------------------------
      (?:                      group, but do not capture:
--------------------------------------------------------------------------------
        0?                       '0' (optional (matching the most
                                 amount possible))
--------------------------------------------------------------------------------
        [1-9]                    any character of: '1' to '9'
--------------------------------------------------------------------------------
       |                        OR
--------------------------------------------------------------------------------
        [12]                     any character of: '1', '2'
--------------------------------------------------------------------------------
        [0-9]                    any character of: '0' to '9'
--------------------------------------------------------------------------------
       |                        OR
--------------------------------------------------------------------------------
        3                        '3'
--------------------------------------------------------------------------------
        [01]                     any character of: '0', '1'
--------------------------------------------------------------------------------
      )                        end of grouping
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
    [-/.]                    any character of: '-', '/', '.'
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      19                       '19'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      20                       '20'
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  (                        group and capture to  (at least 2 times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    [\s»]                    any character of: whitespace (\n, \r,
                             \t, \f, and " "), '»'
--------------------------------------------------------------------------------
  ){2,}                    end of  (NOTE: because you are using a
                           quantifier on this capture, only the LAST
                           repetition of the captured pattern will be
                           stored in )