根据现有单词在R中插入特殊字符
Insert special character in R based on existing words
我正在为我的问题寻找一个直观的解决方案。
我有一个巨大的单词列表,我必须根据某些标准在其中插入一个特殊字符。
因此,如果 two/three 字母单词出现在单元格中,我想在其左右添加“+”
例子
global b2b banking
将转换为 global +b2b+ banking
how to finance commercial ale estate
将转换为 how +to+ finance commercial +ale+ estate
这是示例数据集:
sample <- c("commercial funding",
"global b2b banking"
"how to finance commercial ale estate"
"opening a commercial account",
"international currency account",
"miami imports banking",
"hsbc supply chain financing",
"international business expansion",
"grow business in Us banking",
"commercial trade Asia Pacific",
"business line of credits hsbc",
"Britain commercial banking",
"fx settlement hsbc",
"W Hotels")
data <- data.frame(sample)
此外,是否可以删除字符长度为 1 的行?
示例:
W Hotels
我尝试用 gsub 删除所有单字母单词,
gsub(" *\b[[:alpha:]]{1,1}\b *", " ", sample)
这应该从数据集中删除。
非常感谢任何帮助。
编辑 1
感谢您的帮助,我添加了几行:
sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\b[[:alpha:]]\b",sample)]
sample <- gsub("\b([[:alpha:][:digit:]]{2,3})\b", "+\1+", sample)
sample <- gsub(" ",",",sample)
sample <- gsub("+,","+",sample)
sample <- gsub(",+","+",sample)
sample <- tolower(sample)
sample <- ifelse(substr(sample, 1, 1) == "+", sub("^.", "", sample), sample)
data <- data.frame(sample)
data
sample
1 commercial++funding
2 global+++b2b+++banking
3 how++++to+++finance++commercial+++ale+++estate
4 international++currency++account
5 miami++imports++banking
6 hsbc++supply++chain++financing
7 international++business++expansion
8 grow++business+++in++++us+++banking
9 commercial++trade++asia++pacific
10 business++line+++of+++credits++hsbc
11 britain++commercial++banking
12 fx+++settlement++hsbc
不知怎么的,我无法用 gsub 删除“+”和“,”?我究竟做错了什么 ?
所以 "fx+,settlement,hsbc"
应该是 "fx+settlement,hsbc"
但它正在用额外的 ++.
替换 ,
您需要分两步完成:删除包含 1 个字母的完整单词的项目,然后添加 +
大约 2-3 个字母的单词。
使用
sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\b[[:alnum:]]\b",sample)]
sample <- gsub("\b([[:alnum:]]{2,3})\b", "+\1+", sample)
data <- data.frame(sample)
data
sample[!grepl("\b[[:alnum:]]\b",sample)]
删除包含单词边界 (\b
)、字母 ([[:alnum:]]
) 和单词边界模式的项目。
gsub("\b([[:alnum:]]{2,3})\b", "+\1+", sample)
行将所有 2-3 个字母的完整单词替换为包含在 +
中的这些单词。
结果:
sample
1 commercial funding
2 global +b2b+ banking
3 +how+ +to+ finance commercial +ale+ estate
4 international currency account
5 miami imports banking
6 hsbc supply chain financing
7 international business expansion
8 grow business +in+ +Us+ banking
9 commercial trade Asia Pacific
10 business line +of+ credits hsbc
11 Britain commercial banking
12 +fx+ settlement hsbc
请注意,W Hotels
和 opening a commercial account
已被过滤掉。
对编辑的回答
您在代码中添加了一些替换操作,但您使用的是文字字符串替换,因此,您只需要传递fixed=TRUE
参数:
sample <- gsub(" ",",",sample, fixed=TRUE)
sample <- gsub("+,","+",sample, fixed=TRUE)
sample <- gsub(",+","+",sample, fixed=TRUE)
否则,+
被视为正则表达式量词,必须转义才能被视为文字加号。
此外,如果您需要从字符串的开头删除 all +
,请使用
sample <- sub("^\++", "", sample)
我正在为我的问题寻找一个直观的解决方案。 我有一个巨大的单词列表,我必须根据某些标准在其中插入一个特殊字符。 因此,如果 two/three 字母单词出现在单元格中,我想在其左右添加“+”
例子
global b2b banking
将转换为 global +b2b+ banking
how to finance commercial ale estate
将转换为 how +to+ finance commercial +ale+ estate
这是示例数据集:
sample <- c("commercial funding",
"global b2b banking"
"how to finance commercial ale estate"
"opening a commercial account",
"international currency account",
"miami imports banking",
"hsbc supply chain financing",
"international business expansion",
"grow business in Us banking",
"commercial trade Asia Pacific",
"business line of credits hsbc",
"Britain commercial banking",
"fx settlement hsbc",
"W Hotels")
data <- data.frame(sample)
此外,是否可以删除字符长度为 1 的行? 示例:
W Hotels
我尝试用 gsub 删除所有单字母单词,
gsub(" *\b[[:alpha:]]{1,1}\b *", " ", sample)
这应该从数据集中删除。
非常感谢任何帮助。
编辑 1
感谢您的帮助,我添加了几行:
sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\b[[:alpha:]]\b",sample)]
sample <- gsub("\b([[:alpha:][:digit:]]{2,3})\b", "+\1+", sample)
sample <- gsub(" ",",",sample)
sample <- gsub("+,","+",sample)
sample <- gsub(",+","+",sample)
sample <- tolower(sample)
sample <- ifelse(substr(sample, 1, 1) == "+", sub("^.", "", sample), sample)
data <- data.frame(sample)
data
sample
1 commercial++funding
2 global+++b2b+++banking
3 how++++to+++finance++commercial+++ale+++estate
4 international++currency++account
5 miami++imports++banking
6 hsbc++supply++chain++financing
7 international++business++expansion
8 grow++business+++in++++us+++banking
9 commercial++trade++asia++pacific
10 business++line+++of+++credits++hsbc
11 britain++commercial++banking
12 fx+++settlement++hsbc
不知怎么的,我无法用 gsub 删除“+”和“,”?我究竟做错了什么 ?
所以 "fx+,settlement,hsbc"
应该是 "fx+settlement,hsbc"
但它正在用额外的 ++.
您需要分两步完成:删除包含 1 个字母的完整单词的项目,然后添加 +
大约 2-3 个字母的单词。
使用
sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels")
sample <- sample[!grepl("\b[[:alnum:]]\b",sample)]
sample <- gsub("\b([[:alnum:]]{2,3})\b", "+\1+", sample)
data <- data.frame(sample)
data
sample[!grepl("\b[[:alnum:]]\b",sample)]
删除包含单词边界 (\b
)、字母 ([[:alnum:]]
) 和单词边界模式的项目。
gsub("\b([[:alnum:]]{2,3})\b", "+\1+", sample)
行将所有 2-3 个字母的完整单词替换为包含在 +
中的这些单词。
结果:
sample
1 commercial funding
2 global +b2b+ banking
3 +how+ +to+ finance commercial +ale+ estate
4 international currency account
5 miami imports banking
6 hsbc supply chain financing
7 international business expansion
8 grow business +in+ +Us+ banking
9 commercial trade Asia Pacific
10 business line +of+ credits hsbc
11 Britain commercial banking
12 +fx+ settlement hsbc
请注意,W Hotels
和 opening a commercial account
已被过滤掉。
对编辑的回答
您在代码中添加了一些替换操作,但您使用的是文字字符串替换,因此,您只需要传递fixed=TRUE
参数:
sample <- gsub(" ",",",sample, fixed=TRUE)
sample <- gsub("+,","+",sample, fixed=TRUE)
sample <- gsub(",+","+",sample, fixed=TRUE)
否则,+
被视为正则表达式量词,必须转义才能被视为文字加号。
此外,如果您需要从字符串的开头删除 all +
,请使用
sample <- sub("^\++", "", sample)