如何从特定模式的字符串中删除单词并保留第一个数字
How to remove words from a string in specific pattern and keep the first number
我有一堆字符串,形式如 d.d$ word1 word2 word3
。 Word 2 和 3 可以是字符串也可以不是。例如,我有这些字符串:
1.0% SELLING
3.2% AND 1.0% AND 1.2%
1.0% SOLD PRICE
1.2% PURCHASE PRICE FINAL
2.5% AND 1.0%
1.0% SELLING 2.0% people
我愿意做的是只对字符串1,3,4我只保留1.0%1.0%1.2%
我试图做的是:
gsub("(\d\.\d%) \w+ ((?:\w+)?)+", "\1",x)
我使用上述模式的原因是:
(\d\.\d%) ....> capturing the number part
\w+ .....> first word
((?:\w+)?)+ .....> second and other words (optional and in no-capturing group)
(出于某种原因,\s 在某些情况下对我来说不适用,它被捕获为 s!所以我在单词之间使用 space)
预期结果应如下所示:
1.0%
3.2% AND 1.0% AND 1.2%
1.0%
1.2%
2.5% AND 1.0%
1.0% SELLING 2.0% people
代码应该只改变遵循这种模式的字符串:d.d% (rest of the string are only words and not a number)
(这就是为什么 1.0% SELLING 2.0% 的人没有被改变)
但是,此代码仅适用于 2 个单词,而对于 1 或 3 个单词则无效。请告诉我如何解决这个问题?
您可以从字符串中提取所有数字,并仅替换其中只有一个数字的那些数字。
tmp <- stringr::str_extract_all(x, '\d+\.\d+%')
x[lengths(tmp) == 1] <- unlist(tmp[lengths(tmp) == 1])
x
#[1] "1.0%" "3.2% AND 1.0% AND 1.2%" "1.0%"
#[4] "1.2%" "2.5% AND 1.0%" "1.0% SELLING 2.0% people"
您可以使用
sub('^(\d+\.\d+%)(?:\s+\w+)*$', '\1', x, perl=TRUE)
stringr::str_replace(x, '^(\d+\.\d+%)(?:\s+\w+)*$', '\1')
见regex demo。详情:
^
- 字符串开头
(\d+\.\d+%)
- 第 1 组 (</code>):一个或多个数字,<code>.
,一个或多个数字和一个 %
符号
(?:\s+\w+)*
- 零次或多次重复一个或多个空白字符,然后是一个或多个单词字符
$
- 字符串结尾。
参见 R demo:
x <- c("1.0% SELLING","3.2% AND 1.0% AND 1.2%","1.0% SOLD PRICE","1.2% PURCHASE PRICE FINAL","2.5% AND 1.0%","1.0% SELLING 2.0% people")
sub('^(\d+\.\d+%)(?:\s+\w+)*$', '\1', x, perl=TRUE)
library(stringr)
stringr::str_replace(x, '^(\d+\.\d+%)(?:\s+\w+)*$', '\1')
双输出
[1] "1.0%" "3.2% AND 1.0% AND 1.2%"
[3] "1.0%" "1.2%"
[5] "2.5% AND 1.0%" "1.0% SELLING 2.0% people"
我有一堆字符串,形式如 d.d$ word1 word2 word3
。 Word 2 和 3 可以是字符串也可以不是。例如,我有这些字符串:
1.0% SELLING
3.2% AND 1.0% AND 1.2%
1.0% SOLD PRICE
1.2% PURCHASE PRICE FINAL
2.5% AND 1.0%
1.0% SELLING 2.0% people
我愿意做的是只对字符串1,3,4我只保留1.0%1.0%1.2% 我试图做的是:
gsub("(\d\.\d%) \w+ ((?:\w+)?)+", "\1",x)
我使用上述模式的原因是:
(\d\.\d%) ....> capturing the number part
\w+ .....> first word
((?:\w+)?)+ .....> second and other words (optional and in no-capturing group)
(出于某种原因,\s 在某些情况下对我来说不适用,它被捕获为 s!所以我在单词之间使用 space)
预期结果应如下所示:
1.0%
3.2% AND 1.0% AND 1.2%
1.0%
1.2%
2.5% AND 1.0%
1.0% SELLING 2.0% people
代码应该只改变遵循这种模式的字符串:d.d% (rest of the string are only words and not a number)
(这就是为什么 1.0% SELLING 2.0% 的人没有被改变)
但是,此代码仅适用于 2 个单词,而对于 1 或 3 个单词则无效。请告诉我如何解决这个问题?
您可以从字符串中提取所有数字,并仅替换其中只有一个数字的那些数字。
tmp <- stringr::str_extract_all(x, '\d+\.\d+%')
x[lengths(tmp) == 1] <- unlist(tmp[lengths(tmp) == 1])
x
#[1] "1.0%" "3.2% AND 1.0% AND 1.2%" "1.0%"
#[4] "1.2%" "2.5% AND 1.0%" "1.0% SELLING 2.0% people"
您可以使用
sub('^(\d+\.\d+%)(?:\s+\w+)*$', '\1', x, perl=TRUE)
stringr::str_replace(x, '^(\d+\.\d+%)(?:\s+\w+)*$', '\1')
见regex demo。详情:
^
- 字符串开头(\d+\.\d+%)
- 第 1 组 (</code>):一个或多个数字,<code>.
,一个或多个数字和一个%
符号(?:\s+\w+)*
- 零次或多次重复一个或多个空白字符,然后是一个或多个单词字符$
- 字符串结尾。
参见 R demo:
x <- c("1.0% SELLING","3.2% AND 1.0% AND 1.2%","1.0% SOLD PRICE","1.2% PURCHASE PRICE FINAL","2.5% AND 1.0%","1.0% SELLING 2.0% people")
sub('^(\d+\.\d+%)(?:\s+\w+)*$', '\1', x, perl=TRUE)
library(stringr)
stringr::str_replace(x, '^(\d+\.\d+%)(?:\s+\w+)*$', '\1')
双输出
[1] "1.0%" "3.2% AND 1.0% AND 1.2%"
[3] "1.0%" "1.2%"
[5] "2.5% AND 1.0%" "1.0% SELLING 2.0% people"