在 R 中选择不包含特定字符串的单词
Choosing words not containing specific strings in R
我想选择不包含特定单词,但我有一些条件可以这样做,我想不出我该如何进行。让我解释一下我要做什么:
我不想删除数据框中的任何列,但我想删除特定单元格中的一些字符串。首先,“a a”、“b b”、“c c”等是组,有些组附近有“(s)”。
- 如果一个单元格包含多个组并且其中一个有“(s)”,请删除该组并使用没有“(s)”的第一个组。
- 如果一个单元格只包含一组并且它有“(s)”,则保持原样。
- 如果一个单元格只包含一组并且没有“(s)”,则保持原样。
- 如果单元格开始的组没有“(s)”,则保持原样。
我在本帖末附上了我的数据。
My data looks like this:
ID
COL1
1
a a (s), b b (s), c c
2
d d, e e (s), f f
3
a a (s), b b, f f
4
k k (s), b b (s), c c
5
y y, a a (s), e e (s), g g
6
a a (s), c c, f f
7
k k (s), b b (s), c c
8
e e (s), k k (s), b b (s), f f
9
d d, e e (s), f f
10
k k (s), b b (s), c c
11
d d, a a (s), f f
12
q q (s), t t, h h
13
m m, h h
14
r r, d d
15
q q (s)
16
r r
17
q q (s)
18
c c
19
k k (s)
20
d d
21
m m, k k (s)
22
r r
23
k k (s), b b (s), q q (s), c c
24
a a (s), k k (s), b b (s), f f, d d
25
h h
26
q q (s), a a (s), c c
27
k k (s)
28
e e (s)
29
m m
30
r r
And I would like to make this data like this:
ID
COL1
1
c c
2
d d
3
b b
4
c c
5
y y
6
c c
7
c c
8
f f
9
d d
10
c c
11
d d
12
t t
13
m m
14
r r
15
q q (s)
16
r r
17
q q (s)
18
c c
19
k k (s)
20
d d
21
m m
22
r r
23
c c
24
f f
25
h h
26
c c
27
k k (s)
28
e e (s)
29
m m
30
r r
> structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30), COL1 = c("a a (s), b b (s), c c", "d d, e e (s), f f",
"a a (s), b b, f f", "k k (s), b b (s), c c", "y y, a a (s), e e (s), g g",
"a a (s), c c, f f", "k k (s), b b (s), c c", "e e (s), k k (s), b b (s), f f",
"d d, e e (s), f f", "k k (s), b b (s), c c", "d d, a a (s), f f",
"q q (s), t t, h h", "m m, h h", "r r, d d", "q q (s)", "r r",
"q q (s)", "c c", "k k (s)", "d d", "m m, k k (s)", "r r", "k k (s), b b (s), q q (s), c c",
"a a (s), k k (s), b b (s), f f, d d", "h h", "q q (s), a a (s), c c",
"k k (s)", "e e (s)", "m m", "r r")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -30L))
这里有一个方法。它使用一系列 gregexpr/regmatches
来检测和提取正则表达式,并使用 *apply
循环来保留结果向量的元素。
df1 <-
structure(list(
ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30),
COL1 = c("a a (s), b b (s), c c", "d d, e e (s), f f",
"a a (s), b b, f f", "k k (s), b b (s), c c", "y y, a a (s), e e (s), g g",
"a a (s), c c, f f", "k k (s), b b (s), c c", "e e (s), k k (s), b b (s), f f",
"d d, e e (s), f f", "k k (s), b b (s), c c", "d d, a a (s), f f",
"q q (s), t t, h h", "m m, h h", "r r, d d", "q q (s)", "r r",
"q q (s)", "c c", "k k (s)", "d d", "m m, k k (s)", "r r", "k k (s), b b (s), q q (s), c c",
"a a (s), k k (s), b b (s), f f, d d", "h h", "q q (s), a a (s), c c",
"k k (s)", "e e (s)", "m m", "r r")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L))
m <- gregexpr("[[:alpha:]] [[:alpha:]] \(s\)", df1$COL1)
tmp <- sapply(seq_along(df1$COL1), \(i) {
len <- attr(m[[i]], "match.length")[1]
if(len != nchar(df1$COL1[i])) {
regmatches(df1$COL1[i], m[i]) <- ""
}
df1$COL1[i]
})
m <- gregexpr(", *", tmp)
tmp <- regmatches(tmp, m, invert = TRUE)
tmp <- sapply(tmp, \(x) x[nchar(x) != 0L])
df1$COL1 <- sapply(tmp, `[`, 1)
df1
#> # A tibble: 30 x 2
#> ID COL1
#> <dbl> <chr>
#> 1 1 c c
#> 2 2 d d
#> 3 3 b b
#> 4 4 c c
#> 5 5 y y
#> 6 6 c c
#> 7 7 c c
#> 8 8 f f
#> 9 9 d d
#> 10 10 c c
#> # ... with 20 more rows
由 reprex package (v2.0.1)
于 2022-04-22 创建
我想选择不包含特定单词,但我有一些条件可以这样做,我想不出我该如何进行。让我解释一下我要做什么:
我不想删除数据框中的任何列,但我想删除特定单元格中的一些字符串。首先,“a a”、“b b”、“c c”等是组,有些组附近有“(s)”。
- 如果一个单元格包含多个组并且其中一个有“(s)”,请删除该组并使用没有“(s)”的第一个组。
- 如果一个单元格只包含一组并且它有“(s)”,则保持原样。
- 如果一个单元格只包含一组并且没有“(s)”,则保持原样。
- 如果单元格开始的组没有“(s)”,则保持原样。
我在本帖末附上了我的数据。
My data looks like this:
ID | COL1 |
---|---|
1 | a a (s), b b (s), c c |
2 | d d, e e (s), f f |
3 | a a (s), b b, f f |
4 | k k (s), b b (s), c c |
5 | y y, a a (s), e e (s), g g |
6 | a a (s), c c, f f |
7 | k k (s), b b (s), c c |
8 | e e (s), k k (s), b b (s), f f |
9 | d d, e e (s), f f |
10 | k k (s), b b (s), c c |
11 | d d, a a (s), f f |
12 | q q (s), t t, h h |
13 | m m, h h |
14 | r r, d d |
15 | q q (s) |
16 | r r |
17 | q q (s) |
18 | c c |
19 | k k (s) |
20 | d d |
21 | m m, k k (s) |
22 | r r |
23 | k k (s), b b (s), q q (s), c c |
24 | a a (s), k k (s), b b (s), f f, d d |
25 | h h |
26 | q q (s), a a (s), c c |
27 | k k (s) |
28 | e e (s) |
29 | m m |
30 | r r |
And I would like to make this data like this:
ID | COL1 |
---|---|
1 | c c |
2 | d d |
3 | b b |
4 | c c |
5 | y y |
6 | c c |
7 | c c |
8 | f f |
9 | d d |
10 | c c |
11 | d d |
12 | t t |
13 | m m |
14 | r r |
15 | q q (s) |
16 | r r |
17 | q q (s) |
18 | c c |
19 | k k (s) |
20 | d d |
21 | m m |
22 | r r |
23 | c c |
24 | f f |
25 | h h |
26 | c c |
27 | k k (s) |
28 | e e (s) |
29 | m m |
30 | r r |
> structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30), COL1 = c("a a (s), b b (s), c c", "d d, e e (s), f f",
"a a (s), b b, f f", "k k (s), b b (s), c c", "y y, a a (s), e e (s), g g",
"a a (s), c c, f f", "k k (s), b b (s), c c", "e e (s), k k (s), b b (s), f f",
"d d, e e (s), f f", "k k (s), b b (s), c c", "d d, a a (s), f f",
"q q (s), t t, h h", "m m, h h", "r r, d d", "q q (s)", "r r",
"q q (s)", "c c", "k k (s)", "d d", "m m, k k (s)", "r r", "k k (s), b b (s), q q (s), c c",
"a a (s), k k (s), b b (s), f f, d d", "h h", "q q (s), a a (s), c c",
"k k (s)", "e e (s)", "m m", "r r")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -30L))
这里有一个方法。它使用一系列 gregexpr/regmatches
来检测和提取正则表达式,并使用 *apply
循环来保留结果向量的元素。
df1 <-
structure(list(
ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30),
COL1 = c("a a (s), b b (s), c c", "d d, e e (s), f f",
"a a (s), b b, f f", "k k (s), b b (s), c c", "y y, a a (s), e e (s), g g",
"a a (s), c c, f f", "k k (s), b b (s), c c", "e e (s), k k (s), b b (s), f f",
"d d, e e (s), f f", "k k (s), b b (s), c c", "d d, a a (s), f f",
"q q (s), t t, h h", "m m, h h", "r r, d d", "q q (s)", "r r",
"q q (s)", "c c", "k k (s)", "d d", "m m, k k (s)", "r r", "k k (s), b b (s), q q (s), c c",
"a a (s), k k (s), b b (s), f f, d d", "h h", "q q (s), a a (s), c c",
"k k (s)", "e e (s)", "m m", "r r")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L))
m <- gregexpr("[[:alpha:]] [[:alpha:]] \(s\)", df1$COL1)
tmp <- sapply(seq_along(df1$COL1), \(i) {
len <- attr(m[[i]], "match.length")[1]
if(len != nchar(df1$COL1[i])) {
regmatches(df1$COL1[i], m[i]) <- ""
}
df1$COL1[i]
})
m <- gregexpr(", *", tmp)
tmp <- regmatches(tmp, m, invert = TRUE)
tmp <- sapply(tmp, \(x) x[nchar(x) != 0L])
df1$COL1 <- sapply(tmp, `[`, 1)
df1
#> # A tibble: 30 x 2
#> ID COL1
#> <dbl> <chr>
#> 1 1 c c
#> 2 2 d d
#> 3 3 b b
#> 4 4 c c
#> 5 5 y y
#> 6 6 c c
#> 7 7 c c
#> 8 8 f f
#> 9 9 d d
#> 10 10 c c
#> # ... with 20 more rows
由 reprex package (v2.0.1)
于 2022-04-22 创建