regex remove punct 删除 R 中的非标点字符

Question

在过滤和清理希伯来语文本时，我发现

gsub("[[:punct:]]", "", txt)

实际上删除了一个相关字符。字符是“ק”，它位于键盘上的 "E" 位置。有趣的是，R 中的 gsub 函数删除了“ק”字符，然后所有单词都被弄乱了。有人知道为什么吗？

Answer 1

根据Regular Expressions as used in R：

Certain named classes of characters are predefined. Their interpretation depends on the locale (see locales); the interpretation below is that of the POSIX locale.

累积POSIX 语言环境，[[:punct:]] 应该捕获 ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~。因此，您可能需要调整您的正则表达式以仅删除您想要的字符：

txt <- "!\"#$%&'()*+,\-./:;<=>?@[\\^\]_`{|}~"
gsub("[\\!\"#$%&'()*+,./:;<=>?@[\^\]_`{|}~-]", "", txt, perl = T)

Sample program 输出：

[1] ""

regex remove punct 删除 R 中的非标点字符

regex remove punct removes non-punctuation characters in R

regex

r

punctuation