根据现有单词在R中插入特殊字符

Insert special character in R based on existing words

我正在为我的问题寻找一个直观的解决方案。 我有一个巨大的单词列表,我必须根据某些标准在其中插入一个特殊字符。 因此,如果 two/three 字母单词出现在单元格中,我想在其左右添加“+”

例子

global b2b banking 将转换为 global +b2b+ banking

how to finance commercial ale estate 将转换为 how +to+ finance commercial +ale+ estate

这是示例数据集:

sample <- c("commercial funding",
"global b2b banking"
"how to finance commercial ale estate"
"opening a commercial account",
"international currency account",
"miami imports banking",
"hsbc supply chain financing",
"international business expansion",
"grow business in Us banking",
"commercial trade Asia Pacific",
"business line of credits hsbc",
"Britain commercial banking",
"fx settlement hsbc",
 "W Hotels")
data <- data.frame(sample)

此外,是否可以删除字符长度为 1 的行? 示例:

W Hotels

我尝试用 gsub 删除所有单字母单词,

gsub(" *\b[[:alpha:]]{1,1}\b *", " ", sample) 

这应该从数据集中删除。

非常感谢任何帮助。

编辑 1

感谢您的帮助,我添加了几行:

sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\b[[:alpha:]]\b",sample)]
sample <- gsub("\b([[:alpha:][:digit:]]{2,3})\b", "+\1+", sample)
sample <- gsub(" ",",",sample)
sample <- gsub("+,","+",sample)
sample <- gsub(",+","+",sample)
sample <- tolower(sample)
sample <- ifelse(substr(sample, 1, 1) == "+", sub("^.", "", sample), sample)
data <- data.frame(sample)
data




                                          sample
1                             commercial++funding
2                          global+++b2b+++banking
3  how++++to+++finance++commercial+++ale+++estate
4                international++currency++account
5                         miami++imports++banking
6                  hsbc++supply++chain++financing
7              international++business++expansion
8             grow++business+++in++++us+++banking
9                commercial++trade++asia++pacific
10            business++line+++of+++credits++hsbc
11                   britain++commercial++banking
12                          fx+++settlement++hsbc

不知怎么的,我无法用 gsub 删除“+”和“,”?我究竟做错了什么 ? 所以 "fx+,settlement,hsbc" 应该是 "fx+settlement,hsbc" 但它正在用额外的 ++.

替换 ,

您需要分两步完成:删除包含 1 个字母的完整单词的项目,然后添加 + 大约 2-3 个字母的单词。

使用

sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\b[[:alnum:]]\b",sample)]
sample <- gsub("\b([[:alnum:]]{2,3})\b", "+\1+", sample)
data <- data.frame(sample)
data

R demo

sample[!grepl("\b[[:alnum:]]\b",sample)] 删除包含单词边界 (\b)、字母 ([[:alnum:]]) 和单词边界模式的项目。

gsub("\b([[:alnum:]]{2,3})\b", "+\1+", sample) 行将所有 2-3 个字母的完整单词替换为包含在 + 中的这些单词。

结果:

                                       sample
1                          commercial funding
2                        global +b2b+ banking
3  +how+ +to+ finance commercial +ale+ estate
4              international currency account
5                       miami imports banking
6                 hsbc supply chain financing
7            international business expansion
8             grow business +in+ +Us+ banking
9               commercial trade Asia Pacific
10            business line +of+ credits hsbc
11                 Britain commercial banking
12                       +fx+ settlement hsbc

请注意,W Hotelsopening a commercial account 已被过滤掉。

对编辑的回答

您在代码中添加了一些替换操作,但您使用的是文字字符串替换,因此,您只需要传递fixed=TRUE参数:

sample <- gsub(" ",",",sample, fixed=TRUE)
sample <- gsub("+,","+",sample, fixed=TRUE)
sample <- gsub(",+","+",sample, fixed=TRUE)

否则,+ 被视为正则表达式量词,必须转义才能被视为文字加号。

此外,如果您需要从字符串的开头删除 all +,请使用

sample <- sub("^\++", "", sample)