如何在 R 中使用正则表达式提取特定字符后的关键短语？

Question

我有一个如下所示的数据框：

ID | Tweet_ID | Tweet
1    12345      @sprintcare I did.
2    SPRINT     @12345 Please send us a Private Message.
3    45678      @apple My information is incorrect.
4    APPLE      @45678 What information is incorrect.

我想要做的是一些 case_when 语句来提取所有具有公司名称句柄的推文并忽略数字句柄以创建一个新字段。

我正在尝试但没有成功的当前代码：

tweet_pattern <- " @[^0-9.-]\w+"

Customer <- Customer %>% 
           Response_To_Comp = ifelse(str_detect(Tweet, tweet_pattern), 
                                str_extract(Tweet, tweet_pattern), 
                                NA_character_))

期望的输出：

ID | Tweet_ID | Tweet                                    | Response_To_Comp
1    12345      @sprintcare I did.                         sprintcare
2    SPRINT     @12345 Please send us a Private Message.   NA
3    45678      @apple My information is incorrect.        apple
4    APPLE      @45678 What information is incorrect.      NA

Answer 1

您可以使用 lookbehind 正则表达式来提取 '@' 之后的文本，其中包含一个或多个 A-Za-z 个字符。

library(dplyr)
library(stringr)

tweet_pattern <- "(?<=@)[A-Za-z]+"

df %>%mutate(Response_To_Comp = str_extract(Tweet, tweet_pattern))

#  ID Tweet_ID                                    Tweet Response_To_Comp
#1  1    12345                       @sprintcare I did.       sprintcare
#2  2   SPRINT @12345 Please send us a Private Message.             <NA>
#3  3    45678      @apple My information is incorrect.            apple
#4  4    APPLE    @45678 What information is incorrect.             <NA>

Answer 2

使用 str_detect 和 str_replace

library(stringr)
library(dplyr)
Customer %>%
    mutate(Response_to_Comp = case_when(str_detect(Tweet, "@[^0-9-]+") ~ 
      str_replace(Tweet, "@([A-Za-z]+)\s+.*", "\1")))
  ID Tweet_ID                                    Tweet Response_to_Comp
1  1    12345                       @sprintcare I did.       sprintcare
2  2   SPRINT @12345 Please send us a Private Message.             <NA>
3  3    45678      @apple My information is incorrect.            apple
4  4    APPLE    @45678 What information is incorrect.             <NA>

数据

Customer <- structure(list(ID = 1:4, Tweet_ID = c("12345", "SPRINT", "45678", 
"APPLE"), Tweet = c("@sprintcare I did.", "@12345 Please send us a Private Message.", 
"@apple My information is incorrect.", "@45678 What information is incorrect."
)), class = "data.frame", row.names = c(NA, -4L))

如何在 R 中使用正则表达式提取特定字符后的关键短语？

How to extract key phrases following specific characters using regex in R?

regex

r

dplyr

tidytext

数据