如何在 R 中的字符串之间查找相同的短语
How to find phrases that are the same between strings in R
假设我有以下字符串
c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<",
">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<",
"Patient name Bilbo baggins", "Patient name: Jonny Begood",
"Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy",
"Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD",
"Type of procedure: Colonoscopy", "Type of procedure Colonoscopy",
"Type of procedure: Colonoscopy", "Label 35252", "Label 543 ",
"Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")
我只想提取那些在字符串之间共享的单词或短语,因此结果应该是:"Date of Procedure"
、"Patient name"
、"Type of procedure"
、"Label"
。我尝试使用 tidytext
但它迫使我说出我想要的 n-gram 大小,而可能有一个、两个或三个共享的词组。
当使用带有 ngram 的 tidytext 中的 unnest_tokens
时,您不能指定删除数字或其他不需要的字符。在这种情况下,切换到 quanteda 包会有所帮助。注释在代码中进行解释。
library(quanteda)
text <- c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<",
">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<",
"Patient name Bilbo baggins", "Patient name: Jonny Begood",
"Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy",
"Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD",
"Type of procedure: Colonoscopy", "Type of procedure Colonoscopy",
"Type of procedure: Colonoscopy", "Label 35252", "Label 543 ",
"Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")
# tokenize text and remove punctuation and numbers
toks <- tokens(text, remove_numbers = TRUE, remove_punct = TRUE)
# create 1, 2 and 3 ngrams.
toks_grams <- tokens_ngrams(toks, n = 1:3)
# transform into a document feature matrix (step can be included in next one)
my_dfm <- dfm(toks_grams)
# turn the terms into a frequency table and filter out the ones that have a count of 1
# depending on needs you can filter out words ngrams or choose a higher occuring frequency to filter on.
freqs <- textstat_frequency(my_dfm)
freqs[freqs$frequency > 1, ]
feature frequency rank docfreq group
1 of 9 1 9 all
2 procedure 9 1 9 all
3 of_procedure 9 1 9 all
4 patient 6 4 6 all
5 name 6 4 6 all
6 patient_name 6 4 6 all
7 label 6 4 6 all
8 type 5 8 5 all
9 type_of 5 8 5 all
10 type_of_procedure 5 8 5 all
11 date 4 11 4 all
12 date_of 4 11 4 all
13 date_of_procedure 4 11 4 all
14 colonoscopy 3 14 3 all
15 procedure_colonoscopy 3 14 3 all
16 of_procedure_colonoscopy 3 14 3 all
17 ogd 2 17 2 all
18 procedure_ogd 2 17 2 all
19 of_procedure_ogd 2 17 2 all
假设我有以下字符串
c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<",
">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<",
"Patient name Bilbo baggins", "Patient name: Jonny Begood",
"Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy",
"Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD",
"Type of procedure: Colonoscopy", "Type of procedure Colonoscopy",
"Type of procedure: Colonoscopy", "Label 35252", "Label 543 ",
"Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")
我只想提取那些在字符串之间共享的单词或短语,因此结果应该是:"Date of Procedure"
、"Patient name"
、"Type of procedure"
、"Label"
。我尝试使用 tidytext
但它迫使我说出我想要的 n-gram 大小,而可能有一个、两个或三个共享的词组。
当使用带有 ngram 的 tidytext 中的 unnest_tokens
时,您不能指定删除数字或其他不需要的字符。在这种情况下,切换到 quanteda 包会有所帮助。注释在代码中进行解释。
library(quanteda)
text <- c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<",
">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<",
"Patient name Bilbo baggins", "Patient name: Jonny Begood",
"Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy",
"Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD",
"Type of procedure: Colonoscopy", "Type of procedure Colonoscopy",
"Type of procedure: Colonoscopy", "Label 35252", "Label 543 ",
"Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")
# tokenize text and remove punctuation and numbers
toks <- tokens(text, remove_numbers = TRUE, remove_punct = TRUE)
# create 1, 2 and 3 ngrams.
toks_grams <- tokens_ngrams(toks, n = 1:3)
# transform into a document feature matrix (step can be included in next one)
my_dfm <- dfm(toks_grams)
# turn the terms into a frequency table and filter out the ones that have a count of 1
# depending on needs you can filter out words ngrams or choose a higher occuring frequency to filter on.
freqs <- textstat_frequency(my_dfm)
freqs[freqs$frequency > 1, ]
feature frequency rank docfreq group
1 of 9 1 9 all
2 procedure 9 1 9 all
3 of_procedure 9 1 9 all
4 patient 6 4 6 all
5 name 6 4 6 all
6 patient_name 6 4 6 all
7 label 6 4 6 all
8 type 5 8 5 all
9 type_of 5 8 5 all
10 type_of_procedure 5 8 5 all
11 date 4 11 4 all
12 date_of 4 11 4 all
13 date_of_procedure 4 11 4 all
14 colonoscopy 3 14 3 all
15 procedure_colonoscopy 3 14 3 all
16 of_procedure_colonoscopy 3 14 3 all
17 ogd 2 17 2 all
18 procedure_ogd 2 17 2 all
19 of_procedure_ogd 2 17 2 all