如何从特定模式的字符串中删除单词并保留第一个数字

Question

我有一堆字符串，形式如 d.d$ word1 word2 word3。 Word 2 和 3 可以是字符串也可以不是。例如，我有这些字符串：

1.0% SELLING
3.2% AND 1.0% AND 1.2%
1.0% SOLD PRICE
1.2% PURCHASE PRICE FINAL
2.5% AND 1.0%
1.0% SELLING 2.0% people

我愿意做的是只对字符串1,3,4我只保留1.0%1.0%1.2% 我试图做的是：

gsub("(\d\.\d%) \w+ ((?:\w+)?)+", "\1",x)

我使用上述模式的原因是：

(\d\.\d%) ....> capturing the number part
\w+ .....> first word
((?:\w+)?)+  .....> second and other words (optional and in no-capturing group)

（出于某种原因，\s 在某些情况下对我来说不适用，它被捕获为 s！所以我在单词之间使用 space）

预期结果应如下所示：

1.0%
3.2% AND 1.0% AND 1.2%
1.0%
1.2%
2.5% AND 1.0%
1.0% SELLING 2.0% people

代码应该只改变遵循这种模式的字符串：d.d% (rest of the string are only words and not a number)（这就是为什么 1.0% SELLING 2.0% 的人没有被改变）

但是，此代码仅适用于 2 个单词，而对于 1 或 3 个单词则无效。请告诉我如何解决这个问题？

Answer 1

您可以从字符串中提取所有数字，并仅替换其中只有一个数字的那些数字。

tmp <- stringr::str_extract_all(x, '\d+\.\d+%')
x[lengths(tmp) == 1] <- unlist(tmp[lengths(tmp) == 1])
x

#[1] "1.0%"      "3.2% AND 1.0% AND 1.2%"   "1.0%"                    
#[4] "1.2%"      "2.5% AND 1.0%"            "1.0% SELLING 2.0% people"

Answer 2

您可以使用

sub('^(\d+\.\d+%)(?:\s+\w+)*$', '\1', x, perl=TRUE)
stringr::str_replace(x, '^(\d+\.\d+%)(?:\s+\w+)*$', '\1')

见regex demo。详情：

^ - 字符串开头
(\d+\.\d+%) - 第 1 组 (</code>)：一个或多个数字，<code>.，一个或多个数字和一个 % 符号
(?:\s+\w+)* - 零次或多次重复一个或多个空白字符，然后是一个或多个单词字符
$ - 字符串结尾。

参见 R demo:

x <- c("1.0% SELLING","3.2% AND 1.0% AND 1.2%","1.0% SOLD PRICE","1.2% PURCHASE PRICE FINAL","2.5% AND 1.0%","1.0% SELLING 2.0% people")
sub('^(\d+\.\d+%)(?:\s+\w+)*$', '\1', x, perl=TRUE)
library(stringr)
stringr::str_replace(x, '^(\d+\.\d+%)(?:\s+\w+)*$', '\1')

双输出

[1] "1.0%"                     "3.2% AND 1.0% AND 1.2%"  
[3] "1.0%"                     "1.2%"                    
[5] "2.5% AND 1.0%"            "1.0% SELLING 2.0% people"

如何从特定模式的字符串中删除单词并保留第一个数字

How to remove words from a string in specific pattern and keep the first number

regex

r

gsub