gsed 无法识别 SHIFT_JIS 个字符
gsed does not recognize SHIFT_JIS charactors
我正在编写一个程序,使用 gsed
从 csv 文件中提取多字节字符。
它适用于编码为 UTF-8 的 csv 文件,但不适用于编码为 SHIFT_JIS.
的 csv 文件
test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8 | gsed -r 's/"(.*)","(.*)"/ /'
こんにちは hello%
test % cat sjis_sample.csv | gsed -r 's/"(.*)","(.*)"/ /' | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
LINE 1:
Read file with UTF-8
LINE 2:
Extracted text contents from csv file after converting encoding from SHIFT_JIS to UTF-8
-> Works well
LINE 3:
Extracted text contents from csv file without converting encoding
-> It seems that `gsed` failed to capture text contents with match pattern.
有人知道如何将 gsed
用于 SHIFT_JIS 编码文件吗?
谢谢。
% gsed --version
gsed (GNU sed) 4.8
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Jay Fenlason, Tom Lord, Ken Pizzini,
Paolo Bonzini, Jim Meyering, and Assaf Gordon.
This sed program was built without SELinux support.
GNU sed home page: <https://www.gnu.org/software/sed/>.
General help using GNU software: <https://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.
test % locale
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
已解决
感谢@KamilCuk
GNU sed is locale aware. If you want to work with raw bytes (ie. you can check what bytes represent " in Shift_JIS and feed that to sed) use:
LC_ALL=C sed ....
我将 LANG
而不是 LC_ALL
设置为 C
,因为我无法将 LC_ALL
设置为 C
。
test % cat sjis_convert.sh
#!/bin/bash
LANG=C
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/ /' |\
iconv -f shift_jis -t utf-8
test % ./sjis_convert.sh
こんにちは hello%
附录
我无法将 C
设置为 LC_ALL
。
test % cat sjis_convert.sh
#!/bin/bash
LC_ALL=C
locale
echo ''
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/ /' |\
iconv -f shift_jis -t utf-8
echo ''
locale
test % ./sjis_convert.sh
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
"こんにちは","hello"
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
相反,我将 C
设置为 LANG
并且有效。
test % cat ./sjis_convert.sh
#!/bin/bash
LANG=C
locale
echo ''
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/ /' |\
iconv -f shift_jis -t utf-8
echo ''
locale
test % ./sjis_convert.sh
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
こんにちは hello
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
GNU sed
是区域设置感知的。如果您想使用原始字节(即,您可以检查 Shift_JIS
中代表 "
的字节并将其提供给 sed
),请使用:
LC_ALL=C sed ....
如果您想使用 UTF-8,请设置 UTF-8 区域设置,这很可能是您的默认设置:
LC_ALL=en_US.UTF-8 sed ...
如果您想使用任何其他语言环境,请告诉 sed:
LC_ALL=ja_JP.Shift_JIS sed ...
我正在编写一个程序,使用 gsed
从 csv 文件中提取多字节字符。
它适用于编码为 UTF-8 的 csv 文件,但不适用于编码为 SHIFT_JIS.
的 csv 文件test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8 | gsed -r 's/"(.*)","(.*)"/ /'
こんにちは hello%
test % cat sjis_sample.csv | gsed -r 's/"(.*)","(.*)"/ /' | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
LINE 1:
Read file with UTF-8
LINE 2:
Extracted text contents from csv file after converting encoding from SHIFT_JIS to UTF-8
-> Works well
LINE 3:
Extracted text contents from csv file without converting encoding
-> It seems that `gsed` failed to capture text contents with match pattern.
有人知道如何将 gsed
用于 SHIFT_JIS 编码文件吗?
谢谢。
% gsed --version
gsed (GNU sed) 4.8
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Jay Fenlason, Tom Lord, Ken Pizzini,
Paolo Bonzini, Jim Meyering, and Assaf Gordon.
This sed program was built without SELinux support.
GNU sed home page: <https://www.gnu.org/software/sed/>.
General help using GNU software: <https://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.
test % locale
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
已解决
感谢@KamilCuk
GNU sed is locale aware. If you want to work with raw bytes (ie. you can check what bytes represent " in Shift_JIS and feed that to sed) use:
LC_ALL=C sed ....
我将 LANG
而不是 LC_ALL
设置为 C
,因为我无法将 LC_ALL
设置为 C
。
test % cat sjis_convert.sh
#!/bin/bash
LANG=C
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/ /' |\
iconv -f shift_jis -t utf-8
test % ./sjis_convert.sh
こんにちは hello%
附录
我无法将 C
设置为 LC_ALL
。
test % cat sjis_convert.sh
#!/bin/bash
LC_ALL=C
locale
echo ''
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/ /' |\
iconv -f shift_jis -t utf-8
echo ''
locale
test % ./sjis_convert.sh
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
"こんにちは","hello"
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
相反,我将 C
设置为 LANG
并且有效。
test % cat ./sjis_convert.sh
#!/bin/bash
LANG=C
locale
echo ''
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/ /' |\
iconv -f shift_jis -t utf-8
echo ''
locale
test % ./sjis_convert.sh
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
こんにちは hello
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
GNU sed
是区域设置感知的。如果您想使用原始字节(即,您可以检查 Shift_JIS
中代表 "
的字节并将其提供给 sed
),请使用:
LC_ALL=C sed ....
如果您想使用 UTF-8,请设置 UTF-8 区域设置,这很可能是您的默认设置:
LC_ALL=en_US.UTF-8 sed ...
如果您想使用任何其他语言环境,请告诉 sed:
LC_ALL=ja_JP.Shift_JIS sed ...