从结构化文本数据中获取唯一计数

Question

我想知道如何从结构化数据集中的文本字符串中获取唯一数量的字符。这是我之前 post 的后续问题。我想获得苹果（编码为 App）、香蕉（编码为 Ban）、菠萝（编码为 Pin）、葡萄（编码为 Grp）的唯一计数

    text<- c('AppPinAppBan', 'AppPinOra', 'AppPinGrpLonNYC')
    df<- data.frame(text)

   library(stringr)
   df$fruituniquecount<- str_count(df$A, "App|Ban|Pin|Grp")

   ## I am expecting output as follows:

      text           fruituniquecount
     AppPinAppBan     3
     AppPinOra        2
     AppPinGrpLonNYC  3

Answer 1

也许这可以用基础 R 来完成，不需要外部包。

m <- gregexpr("App|Ban|Pin|Grp", df$text)
df$fruituniquecount <- lengths(lapply(regmatches(df$text, m), unique))

df
#             text fruituniquecount
#1    AppPinAppBan                3
#2       AppPinOra                2
#3 AppPinGrpLonNYC                3

Answer 2

按照你上一个问题接受的答案的思路，那么你就可以了，

library(stringr)

sapply(str_extract_all(df$text, "App|Ban|Pin|Grp"), function(i)length(unique(i)))
#[1]3 2 3

从结构化文本数据中获取唯一计数

Getting unique count from a structured text data

r

stringr

dplyr

stringi