从 R 中的推文开头和结尾删除主题标签
Remove hashtags from beginning and end of tweets in R
我正在尝试从 R 中的字符串开头删除主题标签。
例如:
x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
我想删除字符串末尾的#lateNightThoughts 和#movie 标签。结果:
- "I didn't know it could be #boring. guess I need some fun"
我试过了:
stringi::stri_replace_last_regex(x,'#\S+',"")
但它只删除了最后一个主题标签。
- "I didn't know it could be #boring. guess I need some fun #movie "
知道如何获得预期结果吗?
编辑:
如何从文本开头删除主题标签?
例如:
x<- "#Thomas20 I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
您可以使用
> x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
> sub("\s*\B#\w+(?:\s*#\w+)*\s*$", "", x)
[1] "I didn't know it could be #boring. guess I need some fun"
或者,如果您不关心要从中开始匹配的第一个 #
的上下文,您甚至可以使用
sub("(?:\s*#\w+)+\s*$", "", x)
参见regex demo。
详情
\s*
- 零个或多个空格
\B
- 在当前位置之前,可以有字符串的开头或非单词字符(这通常用于确保您不匹配 [=62 中的 #
=], 所以如果你不需要它, 你可以删除这个非单词边界)
#
- 一个 #
字符
\w+
- 1 个或多个单词字符(字母、数字或 _
)
(?:\s*#\w+)*
- 零次或多次出现:
\s*
- 零个或多个空格
#
- 一个 #
字符
\w+
- 1+ 个单词字符
\s*
- 零个或多个空格
$
- 字符串结尾。
我正在尝试从 R 中的字符串开头删除主题标签。 例如:
x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
我想删除字符串末尾的#lateNightThoughts 和#movie 标签。结果:
- "I didn't know it could be #boring. guess I need some fun"
我试过了:
stringi::stri_replace_last_regex(x,'#\S+',"")
但它只删除了最后一个主题标签。
- "I didn't know it could be #boring. guess I need some fun #movie "
知道如何获得预期结果吗?
编辑:
如何从文本开头删除主题标签? 例如:
x<- "#Thomas20 I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
您可以使用
> x<- "I didn't know it could be #boring. guess I need some fun #movie #lateNightThoughts"
> sub("\s*\B#\w+(?:\s*#\w+)*\s*$", "", x)
[1] "I didn't know it could be #boring. guess I need some fun"
或者,如果您不关心要从中开始匹配的第一个 #
的上下文,您甚至可以使用
sub("(?:\s*#\w+)+\s*$", "", x)
参见regex demo。
详情
\s*
- 零个或多个空格\B
- 在当前位置之前,可以有字符串的开头或非单词字符(这通常用于确保您不匹配 [=62 中的#
=], 所以如果你不需要它, 你可以删除这个非单词边界)#
- 一个#
字符\w+
- 1 个或多个单词字符(字母、数字或_
)(?:\s*#\w+)*
- 零次或多次出现:\s*
- 零个或多个空格#
- 一个#
字符\w+
- 1+ 个单词字符
\s*
- 零个或多个空格$
- 字符串结尾。