如何从 R 中的字符串中删除这些元素
How to remove these elements from my string in R
我的一个专栏包含以下字符串:
259333154-Carat Programmatic»FCO»O»EV3»D&B - FCO Prospects ABM DDD 2020 D&B ABM ENDING 1/31/2020»NA+728 x 90»IN»UNV»TP»PM»dCPM»NAT»BTP»RON»NA»DB»N/A»ENG»M»P159WXZ
238114259-Carat Programmatic»CPO XT5»O»EV2»END DATE 2/28/19 Google Custom Intent - XT5 CPO Google Custom Intent Audience/In Market Audience»NA+160 x 600»IN»DSK»TP»PM»CPM»NAT»BTS»RON»NA»DB»N/A»ENG»M»PX5LV6
251368220-Carat Programmatic»XT6»O»EV1»END DATE 9/30/19 2019 Cadillac SMRT Always On - CRM - SMRT Segment XT6 Desktop Display»NA+160 x 600»IN»DSK»TP»PM»CPM»NAT»PRS»RON»NA»DB»N/A»ENG»M»P12LQG3
235105211-Ebay»Silverado 1500»M»CON»ended 3/20 - ROS - eBay Run of Motors Desktop»NA+300 x 250»IN»DSK»TP»PM»CPM»NAT»SCR»ROS»NA»DB»N/A»ENG»M»PW79JH
234990143-Carat Programmatic»XT4»O»EV2»Endemic - Loyalist|Edmunds&Oracle|N/A|Vehicle|N/A 2»NA+300 x 250»IN»MOB»TP»PM»CPM»NAT»BTO»RON»IMW»DB»N/A»ENG»M»PW7NSN
我的任务是从字符串中删除以下内容:
- 日期(例如
1/31/2020
、3/20
)
- 像
Ending
、ENDING
、END
、end
、End
、END DATE
这样的字符串,但不是字符串的“结尾”像上一个一样 endemic
- 双空格,例如
" "
我真的很难过。我正在使用如下几行代码来执行一些操作,但不知道如何更简洁、更完整地执行此操作:
dat = gsub("2017|2018|2019|2020", "" ,dat)
dat = gsub("»»", "»", dat)
dat = gsub("END ", "", dat)
dat = gsub("end ", "", dat)
dat = gsub("Ended ", "", dat)
dat = gsub("ENDED ", "", dat)
dat = gsub("DATE|date|Date", "", dat)
非常感谢您的帮助!
检查以下内容:
\b(?:0?[1-9]|1[012])(?:[-/.](?:0?[1-9]|[12][0-9]|3[01]))?[-/.](?:19|20)?\d\d\b
应该处理“日期(例如 1/31/2020、3/20)”案例
(?i)\bEnd(?: DATE|(?:ing)?)\b
应该处理诸如“Ending”、“ENDING”、“END”、“end”、“End”、“END DATE”之类的“字符串”,但不能处理“end”像上一个一样“流行”在他们身上”案例
([\s»]){2,}
应该处理“双空格,例如”“”的情况。
合并所有:
gsub("\b(?:End(?:\s+DATE|(?:ing)?)|(?:0?[1-9]|1[012])(?:[-/.](?:0?[1-9]|[12][0-9]|3[01]))?[-/.](?:19|20)?\d\d)\b|([\s»]){2,}", "\1", x, perl=TRUE, ignore.case=TRUE)
参见 proof。
说明
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
End 'End'
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
DATE ' DATE'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
ing 'ing'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
0? '0' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
[1-9] any character of: '1' to '9'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
1 '1'
--------------------------------------------------------------------------------
[012] any character of: '0', '1', '2'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
[-/.] any character of: '-', '/', '.'
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
0? '0' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
[1-9] any character of: '1' to '9'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[12] any character of: '1', '2'
--------------------------------------------------------------------------------
[0-9] any character of: '0' to '9'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
3 '3'
--------------------------------------------------------------------------------
[01] any character of: '0', '1'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
[-/.] any character of: '-', '/', '.'
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
19 '19'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
20 '20'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to (at least 2 times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[\s»] any character of: whitespace (\n, \r,
\t, \f, and " "), '»'
--------------------------------------------------------------------------------
){2,} end of (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in )
我的一个专栏包含以下字符串:
259333154-Carat Programmatic»FCO»O»EV3»D&B - FCO Prospects ABM DDD 2020 D&B ABM ENDING 1/31/2020»NA+728 x 90»IN»UNV»TP»PM»dCPM»NAT»BTP»RON»NA»DB»N/A»ENG»M»P159WXZ
238114259-Carat Programmatic»CPO XT5»O»EV2»END DATE 2/28/19 Google Custom Intent - XT5 CPO Google Custom Intent Audience/In Market Audience»NA+160 x 600»IN»DSK»TP»PM»CPM»NAT»BTS»RON»NA»DB»N/A»ENG»M»PX5LV6
251368220-Carat Programmatic»XT6»O»EV1»END DATE 9/30/19 2019 Cadillac SMRT Always On - CRM - SMRT Segment XT6 Desktop Display»NA+160 x 600»IN»DSK»TP»PM»CPM»NAT»PRS»RON»NA»DB»N/A»ENG»M»P12LQG3
235105211-Ebay»Silverado 1500»M»CON»ended 3/20 - ROS - eBay Run of Motors Desktop»NA+300 x 250»IN»DSK»TP»PM»CPM»NAT»SCR»ROS»NA»DB»N/A»ENG»M»PW79JH
234990143-Carat Programmatic»XT4»O»EV2»Endemic - Loyalist|Edmunds&Oracle|N/A|Vehicle|N/A 2»NA+300 x 250»IN»MOB»TP»PM»CPM»NAT»BTO»RON»IMW»DB»N/A»ENG»M»PW7NSN
我的任务是从字符串中删除以下内容:
- 日期(例如
1/31/2020
、3/20
) - 像
Ending
、ENDING
、END
、end
、End
、END DATE
这样的字符串,但不是字符串的“结尾”像上一个一样endemic
- 双空格,例如
" "
我真的很难过。我正在使用如下几行代码来执行一些操作,但不知道如何更简洁、更完整地执行此操作:
dat = gsub("2017|2018|2019|2020", "" ,dat)
dat = gsub("»»", "»", dat)
dat = gsub("END ", "", dat)
dat = gsub("end ", "", dat)
dat = gsub("Ended ", "", dat)
dat = gsub("ENDED ", "", dat)
dat = gsub("DATE|date|Date", "", dat)
非常感谢您的帮助!
检查以下内容:
\b(?:0?[1-9]|1[012])(?:[-/.](?:0?[1-9]|[12][0-9]|3[01]))?[-/.](?:19|20)?\d\d\b
应该处理“日期(例如 1/31/2020、3/20)”案例(?i)\bEnd(?: DATE|(?:ing)?)\b
应该处理诸如“Ending”、“ENDING”、“END”、“end”、“End”、“END DATE”之类的“字符串”,但不能处理“end”像上一个一样“流行”在他们身上”案例([\s»]){2,}
应该处理“双空格,例如”“”的情况。
合并所有:
gsub("\b(?:End(?:\s+DATE|(?:ing)?)|(?:0?[1-9]|1[012])(?:[-/.](?:0?[1-9]|[12][0-9]|3[01]))?[-/.](?:19|20)?\d\d)\b|([\s»]){2,}", "\1", x, perl=TRUE, ignore.case=TRUE)
参见 proof。
说明
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
End 'End'
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
DATE ' DATE'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
ing 'ing'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
0? '0' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
[1-9] any character of: '1' to '9'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
1 '1'
--------------------------------------------------------------------------------
[012] any character of: '0', '1', '2'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
[-/.] any character of: '-', '/', '.'
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
0? '0' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
[1-9] any character of: '1' to '9'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
[12] any character of: '1', '2'
--------------------------------------------------------------------------------
[0-9] any character of: '0' to '9'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
3 '3'
--------------------------------------------------------------------------------
[01] any character of: '0', '1'
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
[-/.] any character of: '-', '/', '.'
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
19 '19'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
20 '20'
--------------------------------------------------------------------------------
)? end of grouping
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to (at least 2 times
(matching the most amount possible)):
--------------------------------------------------------------------------------
[\s»] any character of: whitespace (\n, \r,
\t, \f, and " "), '»'
--------------------------------------------------------------------------------
){2,} end of (NOTE: because you are using a
quantifier on this capture, only the LAST
repetition of the captured pattern will be
stored in )