如何从特定模式的字符串中删除单词并保留第一个数字

How to remove words from a string in specific pattern and keep the first number

我有一堆字符串,形式如 d.d$ word1 word2 word3。 Word 2 和 3 可以是字符串也可以不是。例如,我有这些字符串:

1.0% SELLING
3.2% AND 1.0% AND 1.2%
1.0% SOLD PRICE
1.2% PURCHASE PRICE FINAL
2.5% AND 1.0%
1.0% SELLING 2.0% people

我愿意做的是只对字符串1,3,4我只保留1.0%1.0%1.2% 我试图做的是:

gsub("(\d\.\d%) \w+ ((?:\w+)?)+", "\1",x)

我使用上述模式的原因是:

(\d\.\d%) ....> capturing the number part
\w+ .....> first word
((?:\w+)?)+  .....> second and other words (optional and in no-capturing group)

(出于某种原因,\s 在某些情况下对我来说不适用,它被捕获为 s!所以我在单词之间使用 space)

预期结果应如下所示:

1.0%
3.2% AND 1.0% AND 1.2%
1.0%
1.2%
2.5% AND 1.0%
1.0% SELLING 2.0% people

代码应该只改变遵循这种模式的字符串:d.d% (rest of the string are only words and not a number)(这就是为什么 1.0% SELLING 2.0% 的人没有被改变)

但是,此代码仅适用于 2 个单词,而对于 1 或 3 个单词则无效。请告诉我如何解决这个问题?

您可以从字符串中提取所有数字,并仅替换其中只有一个数字的那些数字。

tmp <- stringr::str_extract_all(x, '\d+\.\d+%')
x[lengths(tmp) == 1] <- unlist(tmp[lengths(tmp) == 1])
x

#[1] "1.0%"      "3.2% AND 1.0% AND 1.2%"   "1.0%"                    
#[4] "1.2%"      "2.5% AND 1.0%"            "1.0% SELLING 2.0% people"

您可以使用

sub('^(\d+\.\d+%)(?:\s+\w+)*$', '\1', x, perl=TRUE)
stringr::str_replace(x, '^(\d+\.\d+%)(?:\s+\w+)*$', '\1')

regex demo。详情:

  • ^ - 字符串开头
  • (\d+\.\d+%) - 第 1 组 (</code>):一个或多个数字,<code>.,一个或多个数字和一个 % 符号
  • (?:\s+\w+)* - 零次或多次重复一个或多个空白字符,然后是一个或多个单词字符
  • $ - 字符串结尾。

参见 R demo:

x <- c("1.0% SELLING","3.2% AND 1.0% AND 1.2%","1.0% SOLD PRICE","1.2% PURCHASE PRICE FINAL","2.5% AND 1.0%","1.0% SELLING 2.0% people")
sub('^(\d+\.\d+%)(?:\s+\w+)*$', '\1', x, perl=TRUE)
library(stringr)
stringr::str_replace(x, '^(\d+\.\d+%)(?:\s+\w+)*$', '\1')

双输出

[1] "1.0%"                     "3.2% AND 1.0% AND 1.2%"  
[3] "1.0%"                     "1.2%"                    
[5] "2.5% AND 1.0%"            "1.0% SELLING 2.0% people"