为什么这个 sed 表达式没有像预期的那样删除带有韩语的行？

Question

我结合了这些 two 生成了这个 sed 命令：

sed '/[\u3131-\uD79D]/d' text.txt  # Remove all lines with Korean characters

但是它只输出带有韩语字符的行：

$ cat text.txt
1
00:00:00,000 --> 00:00:05,410
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
Hello, today we're going to explain how to use the

$ sed '/[\u3131-\uD79D]/d' text.txt  # Korean characters pattern fails
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게

$ sed '/Hello/d' text.txt           # Simple pattern works
1
00:00:00,000 --> 00:00:05,410
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게

$ sed '/[0-9]/d' text.txt           # Simple range works
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
Hello, today we're going to explain how to use the

$ sed --version                     # Git Bash for Windows 2.33.0.windows.2
sed (GNU sed) 4.8

这是 sed 的错误吗？我能够在 gVim 中成功使用等效命令：

:g/[\u3131-\uD79D]/d

Answer 1

由于sed在POSIX之后，与括号中表达式的排序顺序有关。您需要一个按数字 Unicode 点 C.UTF-8 排序的排序规则顺序，然后，您需要在 utf8 中对范围字符进行编码。有详细说明here.

这就是你如何将它应用到 bash shell 上的范围（我用 linux 来测试它）：

$ # first get octal representation of range unicode code points
$ # iconv is to convert to utf-8 in case your locale is not utf-8
$ printf "\u3131\uD79D" | iconv -t utf-8 | od -An -to1
 343 204 261 355 236 235

$ # format it as a sed range
$ printf '\o%s\o%s\o%s-\o%s\o%s\o%s' $(printf "\u3131\uD79D" | iconv -t utf-8 | od -An -to1); echo
\o343\o204\o261-\o355\o236\o235

$ # use the range in sed
$ LC_ALL=C.UTF-8 sed '/[\o343\o204\o261-\o355\o236\o235]/d' text.txt
...
$

这是输出：

$ LC_ALL=C.UTF-8 sed '/[\o343\o204\o261-\o355\o236\o235]/d' text.txt
1
00:00:00,000 --> 00:00:05,410
Hello, today we're going to explain how to use the

$ sed '/[\u3131-\uD79D]/d' text.txt  # Korean characters pattern fails

$ sed '/Hello/d' text.txt           # Simple pattern works
1
00:00:00,000 --> 00:00:05,410

$ sed '/[0-9]/d' text.txt           # Simple range works
Hello, today we're going to explain how to use the

$

编辑：助手 scrip/functions

此 bash 脚本或其函数可用于获取 sed unicode range:

#!/bin/bash

# sur - sed unicode range
#
#     Converts a unicode range into an octal utf-8 range suitable for sed
#
# Usage:
#        sur \u452 \u490
#
#        sur \u3131 \uD79D

to_octal() {
    printf "" | iconv -t utf-8 | od -An -to1 | sed 's/ \([0-9][0-9]*\)/\o/g'
}

sur () {
    echo "$(to_octal )-$(to_octal )"
}

sur

要使用该脚本，请确保它是可执行的并且在您的 PATH 中。这是有关如何使用这些功能的示例。我只是将它们复制并粘贴到 bash shell:

$ to_octal() {
>     printf "" | iconv -t utf-8 | od -An -to1 | sed 's/ \([0-9][0-9]*\)/\o/g'
> }
$
$ sur () {
>     echo "$(to_octal )-$(to_octal )"
> }
$
$ sur \u3131 \uD79D
\o343\o204\o261-\o355\o236\o235
$ sur \u452 \u490
\o321\o222-\o322\o220
$

为什么这个 sed 表达式没有像预期的那样删除带有韩语的行？

Why doesn't this sed expression remove lines with Korean as expected?

regex

unicode

sed