Clojure 中 trim 不可打印字符的通用方法

Question

我遇到了一个错误，我无法将两个看似 'identical' 的字符串匹配在一起。例如，以下两个字符串无法匹配： “样本”和“样本”。

要重现该问题，可以运行 Clojure 中的以下内容。

(= "sample" "sample") ; returns false

经过一个小时的挫败调试，我发现在第二个字符串的前面有一个零宽度space！通过 backspace 从这个特定示例中删除它是微不足道的。但是我有一个匹配的字符串数据库，似乎有多个字符串面临这个问题。我的问题是：在 Clojure 中有 trim 零宽度 space 的通用方法吗？

我试过的一些方法：

(count (clojure.string/trim "abc")) ; returns 4

(count (clojure.string/replace "abc" #"\s" "")) ; returns 4

这个帖子 Remove zero-width space characters from a JavaScript string 确实提供了一个在这个例子中有效的正则表达式解决方案，即

(count (clojure.string/replace "abc" #"[\u200B-\u200D\uFEFF]" "")) ; returns 3

但是，正如 post 本身所述，还有许多其他可能不可见的潜在 ascii 字符。所以我仍然感兴趣是否有一种更通用的方法不依赖于列出所有可能的不可见 unicode 符号。

Answer 1

我相信，你说的是所谓的非打印字符。基于 Java 中的，您可以将 #"\p{C}" 正则表达式作为模式传递给 replace:

(defn remove-non-printable-characters [x]
  (clojure.string/replace x #"\p{C}" ""))

但是，这将删除换行符，例如\n。所以为了保留那些字符，我们需要一个更复杂的正则表达式：

(defn remove-non-printable-characters [x]
  (clojure.string/replace x #"[\p{C}&&^(\S)]" ""))

此函数将删除不可打印的字符。让我们测试一下：

(= "sample" "sample")
;; => false

(= (remove-non-printable-characters "sample")
   (remove-non-printable-characters "sample"))
;; => true

(remove-non-printable-characters "sam\nple")
;; => "sam\nple"

讨论 \p{C} 模式。

Answer 2

@Rulle 的正则表达式解决方案非常好。 tupelo.chars namespace also has a collection of character classes and predicate functions that could be useful. They work in Clojure and ClojureScript, and also include the ^nbsp; for browsers. In particular, check out the visible? 谓词。

tupelo.string 命名空间还有许多用于字符串处理的辅助和便利函数。

(ns tst.demo.core
  (:use tupelo.core tupelo.test)
  (:require
    [tupelo.chars :as chars]
    [tupelo.string :as str] ))

(def sss
"Some multi-line
string." )

(dotest
  (println "result:")
  (println
    (str/join
      (filterv
        #(or (chars/visible? %) 
             (chars/whitespace? %))
        sss))))

结果

result:
Some multi-line
string.

要使用，请使您的 project.clj 看起来像：

  :dependencies [
                 [org.clojure/clojure "1.10.2-alpha1"]
                 [prismatic/schema "1.1.12"]
                 [tupelo "20.07.01"]
                 ]

Clojure 中 trim 不可打印字符的通用方法

General method to trim non-printable characters in Clojure

string

trim

clojure

string-matching

zero-width-space