正则表达式匹配 R 中相邻和不相邻单词重复的句子
Regex to match sentences with adjacent and non-adjacent word repetition in R
我有一个带有句子的数据框;在某些句子中,单词被多次使用:
df <- data.frame(Turn = c("well this is what the grumble about do n't they ?",
"it 's like being in a play-group , in n it ?",
"oh is that that steak i got the other night ?",
"well where have the middle sized soda stream bottle gone ?",
"this is a half day , right ? needs a full day",
"yourself , everybody 'd be changing your hair in n it ?",
"cos he finishes at four o'clock on that day anyway .",
"no no no i 'm dave and you 're alan .",
"yeah , i mean the the film was quite long though",
"it had steve martin in it , it 's a comedy",
"oh it is a dreary old day in n it ?",
"no it 's not mother theresa , it 's saint theresa .",
"oh have you seen that face lift job he wants ?",
"yeah bolshoi 's right so which one is it then ?"))
我想匹配其中一个单词(任何单词)重复一次或多次的那些句子。
编辑 1:
重复的单词**可以*相邻,但不必相邻。这就是 Regular Expression For Consecutive Duplicate Words 没有回答我的问题的原因。
我使用这段代码取得了一定的成功:
df[grepl("(\w+\b\s)\1{1,}", df$Turn),]
[1] well this is what the grumble about do n't they ?
[2] it 's like being in a play-group , in n it ?
[3] oh is that that steak i got the other night ?
[4] this is a half day , right ? needs a full day
[5] yourself , everybody 'd be changing your hair in n it ?
[6] no no no i 'm dave and you 're alan .
[7] yeah , i mean the the film was quite long though
[8] it had steve martin in it , it 's a comedy
[9] oh it is a dreary old day in n it ?
成功是适度的,因为一些句子匹配 不应该 匹配,例如 yourself , everybody 'd be changing your hair in n it ?
,而另一些不匹配 应该是 ,例如 no it 's not mother theresa , it 's saint theresa .
。如何改进代码以产生精确匹配?
预期结果:
df
Turn
2 it 's like being in a play-group , in n it ?
3 oh is that that steak i got the other night ?
5 this is a half day , right ? needs a full day
8 no no no i 'm dave and you 're alan .
9 yeah , i mean the the film was quite long though
10 it had steve martin in it , it 's a comedy
11 oh it is a dreary old day in n it ?
12 no it 's not mother theresa , it 's saint theresa .
编辑 2:
另一个问题是如何定义重复单词的确切数量。上面的不完美正则表达式匹配至少重复一次的单词。如果我将量词更改为 {2}
,从而寻找一个词的三次出现,我会得到这个代码和这个结果:
df[grepl("(\w+\b\s)\1{2}", df$Turn),]
[1] no no no i 'm dave and you 're alan . # "no" occurs 3 times
但是再次匹配不完美,因为 预期 结果将是:
[1] no no no i 'm dave and you 're alan . # "no" occurs 3 times
[2] it had steve martin in it , it 's a comedy # "it" occurs 3 times
非常感谢任何帮助!
用于定义重复单词的确切数量的选项。
提取相同词出现3次的句子
更改正则表达式。
(\s?\b\w+\b\s)(.*\1){2}
(\s?\b\w+\b\s) captured by Group 1
- \s? : 空白 space 出现零次或一次。
- \b\w+\b :确切的单词字符。
\s : 空白 space 出现一次。
(.*) captured by Group 2
(.*\1) : 在第 1 组再次匹配之前出现零次或多次的任何字符。
(.*\1){2} :第 2 组匹配两次。
代码
df$Turn[grepl("(\s?\b\w+\b\s)(.*\1){2}", df$Turn, perl = T)]
# [1] "no no no i 'm dave and you 're alan ."
# [2] "it had steve martin in it , it 's a comedy"
- 使用
strsplit(split="\s")
将句子拆分成单词。
- 用
sapply
和table
统计每个列表元素出现的单词数,然后select个满足要求的句子
代码
library(magrittr)
df$Turn %<>% as.character()
s<-strsplit(df$Turn,"\s") %>% sapply(.,function(i)table(i) %>% .[.==3])
df$Turn[which(s!=0)]
# [1] "no no no i 'm dave and you 're alan ."
# [2] "it had steve martin in it , it 's a comedy"
希望这对您有所帮助:)
我宁愿再通过一次来处理这个任务。首先,我在原始数据框中添加了一个组变量。然后,我统计了每个单词在每个句子中出现了多少次,并创建了一个数据框,即 mytemp
.
library(tidyverse)
mutate(df, id = 1:n()) -> df
mutate(df, id = 1:n()) %>%
mutate(word = strsplit(x = Turn, split = " ")) %>%
unnest(word) %>%
count(id, word, name = "frequency", sort = TRUE) -> mytemp
使用这个数据框,可以直接识别句子。我对数据进行了子集化,并为单词出现三次的句子获得了 id
。我同样识别出不止一次出现的词,得到id
。最后,我使用 three
和 twice
.
中的 id
数字对原始数据进行子集化
# Search words that appear 3 times
three <- filter(mytemp, frequency == 3) %>%
pull(id) %>%
unique()
# Serach words that appear more than once.
twice <- filter(mytemp, frequency > 1) %>%
pull(id) %>%
unique()
# Go back to the original data and handle subsetting
filter(df, id %in% three)
Turn id
<chr> <int>
1 no no no i 'm dave and you 're alan . 8
2 it had steve martin in it , it 's a comedy 10
filter(df, id %in% twice)
Turn id
<chr> <int>
1 it 's like being in a play-group , in n it ? 2
2 oh is that that steak i got the other night ? 3
3 this is a half day , right ? needs a full day 5
4 no no no i 'm dave and you 're alan . 8
5 yeah , i mean the the film was quite long though 9
6 it had steve martin in it , it 's a comedy 10
7 oh it is a dreary old day in n it ? 11
8 no it 's not mother theresa , it 's saint theresa . 12
我有一个带有句子的数据框;在某些句子中,单词被多次使用:
df <- data.frame(Turn = c("well this is what the grumble about do n't they ?",
"it 's like being in a play-group , in n it ?",
"oh is that that steak i got the other night ?",
"well where have the middle sized soda stream bottle gone ?",
"this is a half day , right ? needs a full day",
"yourself , everybody 'd be changing your hair in n it ?",
"cos he finishes at four o'clock on that day anyway .",
"no no no i 'm dave and you 're alan .",
"yeah , i mean the the film was quite long though",
"it had steve martin in it , it 's a comedy",
"oh it is a dreary old day in n it ?",
"no it 's not mother theresa , it 's saint theresa .",
"oh have you seen that face lift job he wants ?",
"yeah bolshoi 's right so which one is it then ?"))
我想匹配其中一个单词(任何单词)重复一次或多次的那些句子。
编辑 1:
重复的单词**可以*相邻,但不必相邻。这就是 Regular Expression For Consecutive Duplicate Words 没有回答我的问题的原因。
我使用这段代码取得了一定的成功:
df[grepl("(\w+\b\s)\1{1,}", df$Turn),]
[1] well this is what the grumble about do n't they ?
[2] it 's like being in a play-group , in n it ?
[3] oh is that that steak i got the other night ?
[4] this is a half day , right ? needs a full day
[5] yourself , everybody 'd be changing your hair in n it ?
[6] no no no i 'm dave and you 're alan .
[7] yeah , i mean the the film was quite long though
[8] it had steve martin in it , it 's a comedy
[9] oh it is a dreary old day in n it ?
成功是适度的,因为一些句子匹配 不应该 匹配,例如 yourself , everybody 'd be changing your hair in n it ?
,而另一些不匹配 应该是 ,例如 no it 's not mother theresa , it 's saint theresa .
。如何改进代码以产生精确匹配?
预期结果:
df
Turn
2 it 's like being in a play-group , in n it ?
3 oh is that that steak i got the other night ?
5 this is a half day , right ? needs a full day
8 no no no i 'm dave and you 're alan .
9 yeah , i mean the the film was quite long though
10 it had steve martin in it , it 's a comedy
11 oh it is a dreary old day in n it ?
12 no it 's not mother theresa , it 's saint theresa .
编辑 2:
另一个问题是如何定义重复单词的确切数量。上面的不完美正则表达式匹配至少重复一次的单词。如果我将量词更改为 {2}
,从而寻找一个词的三次出现,我会得到这个代码和这个结果:
df[grepl("(\w+\b\s)\1{2}", df$Turn),]
[1] no no no i 'm dave and you 're alan . # "no" occurs 3 times
但是再次匹配不完美,因为 预期 结果将是:
[1] no no no i 'm dave and you 're alan . # "no" occurs 3 times
[2] it had steve martin in it , it 's a comedy # "it" occurs 3 times
非常感谢任何帮助!
用于定义重复单词的确切数量的选项。
提取相同词出现3次的句子
更改正则表达式。
(\s?\b\w+\b\s)(.*\1){2}
(\s?\b\w+\b\s) captured by Group 1
- \s? : 空白 space 出现零次或一次。
- \b\w+\b :确切的单词字符。
\s : 空白 space 出现一次。
(.*) captured by Group 2
(.*\1) : 在第 1 组再次匹配之前出现零次或多次的任何字符。
(.*\1){2} :第 2 组匹配两次。
代码
df$Turn[grepl("(\s?\b\w+\b\s)(.*\1){2}", df$Turn, perl = T)]
# [1] "no no no i 'm dave and you 're alan ."
# [2] "it had steve martin in it , it 's a comedy"
- 使用
strsplit(split="\s")
将句子拆分成单词。- 用
sapply
和table
统计每个列表元素出现的单词数,然后select个满足要求的句子
- 用
代码
library(magrittr)
df$Turn %<>% as.character()
s<-strsplit(df$Turn,"\s") %>% sapply(.,function(i)table(i) %>% .[.==3])
df$Turn[which(s!=0)]
# [1] "no no no i 'm dave and you 're alan ."
# [2] "it had steve martin in it , it 's a comedy"
希望这对您有所帮助:)
我宁愿再通过一次来处理这个任务。首先,我在原始数据框中添加了一个组变量。然后,我统计了每个单词在每个句子中出现了多少次,并创建了一个数据框,即 mytemp
.
library(tidyverse)
mutate(df, id = 1:n()) -> df
mutate(df, id = 1:n()) %>%
mutate(word = strsplit(x = Turn, split = " ")) %>%
unnest(word) %>%
count(id, word, name = "frequency", sort = TRUE) -> mytemp
使用这个数据框,可以直接识别句子。我对数据进行了子集化,并为单词出现三次的句子获得了 id
。我同样识别出不止一次出现的词,得到id
。最后,我使用 three
和 twice
.
id
数字对原始数据进行子集化
# Search words that appear 3 times
three <- filter(mytemp, frequency == 3) %>%
pull(id) %>%
unique()
# Serach words that appear more than once.
twice <- filter(mytemp, frequency > 1) %>%
pull(id) %>%
unique()
# Go back to the original data and handle subsetting
filter(df, id %in% three)
Turn id
<chr> <int>
1 no no no i 'm dave and you 're alan . 8
2 it had steve martin in it , it 's a comedy 10
filter(df, id %in% twice)
Turn id
<chr> <int>
1 it 's like being in a play-group , in n it ? 2
2 oh is that that steak i got the other night ? 3
3 this is a half day , right ? needs a full day 5
4 no no no i 'm dave and you 're alan . 8
5 yeah , i mean the the film was quite long though 9
6 it had steve martin in it , it 's a comedy 10
7 oh it is a dreary old day in n it ? 11
8 no it 's not mother theresa , it 's saint theresa . 12