如何在 R 中的字符串之间查找相同的短语

Question

假设我有以下字符串

c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<", 
">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<", 
"Patient name Bilbo baggins", "Patient name: Jonny Begood", 
"Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy", 
"Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD", 
"Type of procedure: Colonoscopy", "Type of procedure Colonoscopy", 
"Type of procedure: Colonoscopy", "Label 35252", "Label 543 ", 
"Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")

我只想提取那些在字符串之间共享的单词或短语，因此结果应该是："Date of Procedure"、"Patient name"、"Type of procedure"、"Label"。我尝试使用 tidytext 但它迫使我说出我想要的 n-gram 大小，而可能有一个、两个或三个共享的词组。

Answer 1

当使用带有 ngram 的 tidytext 中的 unnest_tokens 时，您不能指定删除数字或其他不需要的字符。在这种情况下，切换到 quanteda 包会有所帮助。注释在代码中进行解释。

library(quanteda)
text <- c(">Date of Procedure 01/09/2018<", ">Date of Procedure 01/10/2018<", 
          ">Date of Procedure 03/09/2018<", ">Date of Procedure 04/09/2018<", 
          "Patient name Bilbo baggins", "Patient name: Jonny Begood", 
          "Patient name Elma Fudd", "Patient name Miss Puddleduck", "Patient name: Itsy Bitsy", 
          "Patient name: Lala", "Type of procedure: OGD", "Type of procedure: OGD", 
          "Type of procedure: Colonoscopy", "Type of procedure Colonoscopy", 
          "Type of procedure: Colonoscopy", "Label 35252", "Label 543 ", 
          "Label 5254 ", "Label 23", "Label 555555 ", "Label 54354")

# tokenize text and remove punctuation and numbers 
toks <- tokens(text, remove_numbers = TRUE, remove_punct = TRUE)

# create 1, 2 and 3 ngrams.
toks_grams <- tokens_ngrams(toks, n = 1:3)

# transform into a document feature matrix (step can be included in next one)    
my_dfm <- dfm(toks_grams)

# turn the terms into a frequency table and filter out the ones that have a count of 1
# depending on needs you can filter out words ngrams or choose a higher occuring frequency to filter on.
freqs <- textstat_frequency(my_dfm)
freqs[freqs$frequency > 1, ]


                    feature frequency rank docfreq group
1                        of         9    1       9   all
2                 procedure         9    1       9   all
3              of_procedure         9    1       9   all
4                   patient         6    4       6   all
5                      name         6    4       6   all
6              patient_name         6    4       6   all
7                     label         6    4       6   all
8                      type         5    8       5   all
9                   type_of         5    8       5   all
10        type_of_procedure         5    8       5   all
11                     date         4   11       4   all
12                  date_of         4   11       4   all
13        date_of_procedure         4   11       4   all
14              colonoscopy         3   14       3   all
15    procedure_colonoscopy         3   14       3   all
16 of_procedure_colonoscopy         3   14       3   all
17                      ogd         2   17       2   all
18            procedure_ogd         2   17       2   all
19         of_procedure_ogd         2   17       2   all

如何在 R 中的字符串之间查找相同的短语

How to find phrases that are the same between strings in R

r

text-mining