为什么这个 sed 表达式没有像预期的那样删除带有韩语的行?
Why doesn't this sed expression remove lines with Korean as expected?
我结合了这些 two 生成了这个 sed 命令:
sed '/[\u3131-\uD79D]/d' text.txt # Remove all lines with Korean characters
但是它只输出带有韩语字符的行:
$ cat text.txt
1
00:00:00,000 --> 00:00:05,410
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
Hello, today we're going to explain how to use the
$ sed '/[\u3131-\uD79D]/d' text.txt # Korean characters pattern fails
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
$ sed '/Hello/d' text.txt # Simple pattern works
1
00:00:00,000 --> 00:00:05,410
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
$ sed '/[0-9]/d' text.txt # Simple range works
안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게
Hello, today we're going to explain how to use the
$ sed --version # Git Bash for Windows 2.33.0.windows.2
sed (GNU sed) 4.8
这是 sed 的错误吗?我能够在 gVim 中成功使用等效命令:
:g/[\u3131-\uD79D]/d
由于sed在POSIX之后,与括号中表达式的排序顺序有关。您需要一个按数字 Unicode 点 C.UTF-8 排序的排序规则顺序,然后,您需要在 utf8 中对范围字符进行编码。有详细说明here.
这就是你如何将它应用到 bash shell 上的范围(我用 linux 来测试它):
$ # first get octal representation of range unicode code points
$ # iconv is to convert to utf-8 in case your locale is not utf-8
$ printf "\u3131\uD79D" | iconv -t utf-8 | od -An -to1
343 204 261 355 236 235
$ # format it as a sed range
$ printf '\o%s\o%s\o%s-\o%s\o%s\o%s' $(printf "\u3131\uD79D" | iconv -t utf-8 | od -An -to1); echo
\o343\o204\o261-\o355\o236\o235
$ # use the range in sed
$ LC_ALL=C.UTF-8 sed '/[\o343\o204\o261-\o355\o236\o235]/d' text.txt
...
$
这是输出:
$ LC_ALL=C.UTF-8 sed '/[\o343\o204\o261-\o355\o236\o235]/d' text.txt
1
00:00:00,000 --> 00:00:05,410
Hello, today we're going to explain how to use the
$ sed '/[\u3131-\uD79D]/d' text.txt # Korean characters pattern fails
$ sed '/Hello/d' text.txt # Simple pattern works
1
00:00:00,000 --> 00:00:05,410
$ sed '/[0-9]/d' text.txt # Simple range works
Hello, today we're going to explain how to use the
$
编辑:助手 scrip/functions
此 bash 脚本或其函数可用于获取 sed unicode range
:
#!/bin/bash
# sur - sed unicode range
#
# Converts a unicode range into an octal utf-8 range suitable for sed
#
# Usage:
# sur \u452 \u490
#
# sur \u3131 \uD79D
to_octal() {
printf "" | iconv -t utf-8 | od -An -to1 | sed 's/ \([0-9][0-9]*\)/\o/g'
}
sur () {
echo "$(to_octal )-$(to_octal )"
}
sur
要使用该脚本,请确保它是可执行的并且在您的 PATH 中。这是有关如何使用这些功能的示例。我只是将它们复制并粘贴到 bash shell:
$ to_octal() {
> printf "" | iconv -t utf-8 | od -An -to1 | sed 's/ \([0-9][0-9]*\)/\o/g'
> }
$
$ sur () {
> echo "$(to_octal )-$(to_octal )"
> }
$
$ sur \u3131 \uD79D
\o343\o204\o261-\o355\o236\o235
$ sur \u452 \u490
\o321\o222-\o322\o220
$
我结合了这些 two
sed '/[\u3131-\uD79D]/d' text.txt # Remove all lines with Korean characters
但是它只输出带有韩语字符的行:
$ cat text.txt 1 00:00:00,000 --> 00:00:05,410 안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게 Hello, today we're going to explain how to use the $ sed '/[\u3131-\uD79D]/d' text.txt # Korean characters pattern fails 안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게 $ sed '/Hello/d' text.txt # Simple pattern works 1 00:00:00,000 --> 00:00:05,410 안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게 $ sed '/[0-9]/d' text.txt # Simple range works 안녕하세요 오늘은 버터플라이 가드를 하고 있는 상대에게 Hello, today we're going to explain how to use the $ sed --version # Git Bash for Windows 2.33.0.windows.2 sed (GNU sed) 4.8
这是 sed 的错误吗?我能够在 gVim 中成功使用等效命令:
:g/[\u3131-\uD79D]/d
由于sed在POSIX之后,与括号中表达式的排序顺序有关。您需要一个按数字 Unicode 点 C.UTF-8 排序的排序规则顺序,然后,您需要在 utf8 中对范围字符进行编码。有详细说明here.
这就是你如何将它应用到 bash shell 上的范围(我用 linux 来测试它):
$ # first get octal representation of range unicode code points
$ # iconv is to convert to utf-8 in case your locale is not utf-8
$ printf "\u3131\uD79D" | iconv -t utf-8 | od -An -to1
343 204 261 355 236 235
$ # format it as a sed range
$ printf '\o%s\o%s\o%s-\o%s\o%s\o%s' $(printf "\u3131\uD79D" | iconv -t utf-8 | od -An -to1); echo
\o343\o204\o261-\o355\o236\o235
$ # use the range in sed
$ LC_ALL=C.UTF-8 sed '/[\o343\o204\o261-\o355\o236\o235]/d' text.txt
...
$
这是输出:
$ LC_ALL=C.UTF-8 sed '/[\o343\o204\o261-\o355\o236\o235]/d' text.txt
1
00:00:00,000 --> 00:00:05,410
Hello, today we're going to explain how to use the
$ sed '/[\u3131-\uD79D]/d' text.txt # Korean characters pattern fails
$ sed '/Hello/d' text.txt # Simple pattern works
1
00:00:00,000 --> 00:00:05,410
$ sed '/[0-9]/d' text.txt # Simple range works
Hello, today we're going to explain how to use the
$
编辑:助手 scrip/functions
此 bash 脚本或其函数可用于获取 sed unicode range
:
#!/bin/bash
# sur - sed unicode range
#
# Converts a unicode range into an octal utf-8 range suitable for sed
#
# Usage:
# sur \u452 \u490
#
# sur \u3131 \uD79D
to_octal() {
printf "" | iconv -t utf-8 | od -An -to1 | sed 's/ \([0-9][0-9]*\)/\o/g'
}
sur () {
echo "$(to_octal )-$(to_octal )"
}
sur
要使用该脚本,请确保它是可执行的并且在您的 PATH 中。这是有关如何使用这些功能的示例。我只是将它们复制并粘贴到 bash shell:
$ to_octal() {
> printf "" | iconv -t utf-8 | od -An -to1 | sed 's/ \([0-9][0-9]*\)/\o/g'
> }
$
$ sur () {
> echo "$(to_octal )-$(to_octal )"
> }
$
$ sur \u3131 \uD79D
\o343\o204\o261-\o355\o236\o235
$ sur \u452 \u490
\o321\o222-\o322\o220
$