从 R 中的推文开头和结尾删除主题标签

Remove hashtags from beginning and end of tweets in R

我正在尝试从 R 中的字符串开头删除主题标签。 例如:

 x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"

我想删除字符串末尾的#lateNightThoughts 和#movie 标签。结果:

 - "I didn't know it could be #boring. guess I need some fun"

我试过了:

stringi::stri_replace_last_regex(x,'#\S+',"")

但它只删除了最后一个主题标签。

- "I didn't know it could be #boring. guess I need some fun #movie "

知道如何获得预期结果吗?

编辑:

如何从文本开头删除主题标签? 例如:

x<- "#Thomas20 I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"

您可以使用

>  x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
> sub("\s*\B#\w+(?:\s*#\w+)*\s*$", "", x)
[1] "I didn't know it could be #boring. guess I need some fun"

或者,如果您不关心要从中开始匹配的第一个 # 的上下文,您甚至可以使用

sub("(?:\s*#\w+)+\s*$", "", x)

参见regex demo

详情

  • \s* - 零个或多个空格
  • \B - 在当前位置之前,可以有字符串的开头或非单词字符(这通常用于确保您不匹配 [=62 中的 # =], 所以如果你不需要它, 你可以删除这个非单词边界)
  • # - 一个 # 字符
  • \w+ - 1 个或多个单词字符(字母、数字或 _
  • (?:\s*#\w+)* - 零次或多次出现:
    • \s* - 零个或多个空格
    • # - 一个 # 字符
    • \w+ - 1+ 个单词字符
  • \s* - 零个或多个空格
  • $ - 字符串结尾。