如何在 bash 中的多字符定界符上拆分字符串?
Howto split a string on a multi-character delimiter in bash?
为什么下面的 bash 代码不起作用?
for i in $( echo "emmbbmmaaddsb" | split -t "mm" )
do
echo "$i"
done
预期输出:
e
bb
aaddsb
由于您需要换行符,因此您可以简单地将字符串中所有 mm
的实例替换为换行符。纯原生 bash:
in='emmbbmmaaddsb'
sep='mm'
printf '%s\n' "${in//$sep/$'\n'}"
如果您想在较长的输入流上进行此类替换,最好使用 awk
,因为 bash 的内置字符串操作不能很好地扩展到超过几千字节的内容。 gsub_literal
shell函数(后端进入awk
)在BashFAQ #21中适用:
# Taken from http://mywiki.wooledge.org/BashFAQ/021
# usage: gsub_literal STR REP
# replaces all instances of STR with REP. reads from stdin and writes to stdout.
gsub_literal() {
# STR cannot be empty
[[ ]] || return
# string manip needed to escape '\'s, so awk doesn't expand '\n' and such
awk -v str="${1//\/\\}" -v rep="${2//\/\\}" '
# get the length of the search string
BEGIN {
len = length(str);
}
{
# empty the output string
out = "";
# continue looping while the search string is in the line
while (i = index([=11=], str)) {
# append everything up to the search string, and the replacement string
out = out substr([=11=], 1, i-1) rep;
# remove everything up to and including the first instance of the
# search string from the line
[=11=] = substr([=11=], i + len);
}
# append whatever is left
out = out [=11=];
print out;
}
'
}
...在此上下文中用作:
gsub_literal "mm" $'\n' <your-input-file.txt >your-output-file.txt
推荐的字符替换工具是 sed
的命令 s/regexp/replacement/
一次正则表达式或全局 s/regexp/replacement/g
,你甚至不需要循环或变量。
管道化您的 echo
输出并尝试用换行符 \n
:
替换字符 mm
echo "emmbbmmaaddsb" | sed 's/mm/\n/g'
输出为:
e
bb
aaddsb
下面给出了一个更一般的示例,没有用单字符定界符替换多字符定界符:
使用参数扩展:(来自@gniourf_gniourf的评论)
#!/bin/bash
str="LearnABCtoABCSplitABCaABCString"
delimiter=ABC
s=$str$delimiter
array=();
while [[ $s ]]; do
array+=( "${s%%"$delimiter"*}" );
s=${s#*"$delimiter"};
done;
declare -p array
一种比较粗暴的方式
#!/bin/bash
# main string
str="LearnABCtoABCSplitABCaABCString"
# delimiter string
delimiter="ABC"
#length of main string
strLen=${#str}
#length of delimiter string
dLen=${#delimiter}
#iterator for length of string
i=0
#length tracker for ongoing substring
wordLen=0
#starting position for ongoing substring
strP=0
array=()
while [ $i -lt $strLen ]; do
if [ $delimiter == ${str:$i:$dLen} ]; then
array+=(${str:strP:$wordLen})
strP=$(( i + dLen ))
wordLen=0
i=$(( i + dLen ))
fi
i=$(( i + 1 ))
wordLen=$(( wordLen + 1 ))
done
array+=(${str:strP:$wordLen})
declare -p array
参考 - Bash Tutorial - Bash Split String
借助 awk,您可以使用 gsub 替换所有正则表达式匹配项。
如您的问题,要用新行替换两个或更多 'm' 个字符的所有子字符串,运行:
echo "emmbbmmaaddsb" | awk '{ gsub(/mm+/, "\n" ); print; }'
e
bb
aaddsb
gsub() 中的“g”代表“全局”,意思是到处替换。
您也可以要求只打印 N 个匹配项,例如:
echo "emmbbmmaaddsb" | awk '{ gsub(/mm+/, " " ); print ; }'
bb
为什么下面的 bash 代码不起作用?
for i in $( echo "emmbbmmaaddsb" | split -t "mm" )
do
echo "$i"
done
预期输出:
e
bb
aaddsb
由于您需要换行符,因此您可以简单地将字符串中所有 mm
的实例替换为换行符。纯原生 bash:
in='emmbbmmaaddsb'
sep='mm'
printf '%s\n' "${in//$sep/$'\n'}"
如果您想在较长的输入流上进行此类替换,最好使用 awk
,因为 bash 的内置字符串操作不能很好地扩展到超过几千字节的内容。 gsub_literal
shell函数(后端进入awk
)在BashFAQ #21中适用:
# Taken from http://mywiki.wooledge.org/BashFAQ/021
# usage: gsub_literal STR REP
# replaces all instances of STR with REP. reads from stdin and writes to stdout.
gsub_literal() {
# STR cannot be empty
[[ ]] || return
# string manip needed to escape '\'s, so awk doesn't expand '\n' and such
awk -v str="${1//\/\\}" -v rep="${2//\/\\}" '
# get the length of the search string
BEGIN {
len = length(str);
}
{
# empty the output string
out = "";
# continue looping while the search string is in the line
while (i = index([=11=], str)) {
# append everything up to the search string, and the replacement string
out = out substr([=11=], 1, i-1) rep;
# remove everything up to and including the first instance of the
# search string from the line
[=11=] = substr([=11=], i + len);
}
# append whatever is left
out = out [=11=];
print out;
}
'
}
...在此上下文中用作:
gsub_literal "mm" $'\n' <your-input-file.txt >your-output-file.txt
推荐的字符替换工具是 sed
的命令 s/regexp/replacement/
一次正则表达式或全局 s/regexp/replacement/g
,你甚至不需要循环或变量。
管道化您的 echo
输出并尝试用换行符 \n
:
mm
echo "emmbbmmaaddsb" | sed 's/mm/\n/g'
输出为:
e
bb
aaddsb
下面给出了一个更一般的示例,没有用单字符定界符替换多字符定界符:
使用参数扩展:(来自@gniourf_gniourf的评论)
#!/bin/bash
str="LearnABCtoABCSplitABCaABCString"
delimiter=ABC
s=$str$delimiter
array=();
while [[ $s ]]; do
array+=( "${s%%"$delimiter"*}" );
s=${s#*"$delimiter"};
done;
declare -p array
一种比较粗暴的方式
#!/bin/bash
# main string
str="LearnABCtoABCSplitABCaABCString"
# delimiter string
delimiter="ABC"
#length of main string
strLen=${#str}
#length of delimiter string
dLen=${#delimiter}
#iterator for length of string
i=0
#length tracker for ongoing substring
wordLen=0
#starting position for ongoing substring
strP=0
array=()
while [ $i -lt $strLen ]; do
if [ $delimiter == ${str:$i:$dLen} ]; then
array+=(${str:strP:$wordLen})
strP=$(( i + dLen ))
wordLen=0
i=$(( i + dLen ))
fi
i=$(( i + 1 ))
wordLen=$(( wordLen + 1 ))
done
array+=(${str:strP:$wordLen})
declare -p array
参考 - Bash Tutorial - Bash Split String
借助 awk,您可以使用 gsub 替换所有正则表达式匹配项。
如您的问题,要用新行替换两个或更多 'm' 个字符的所有子字符串,运行:
echo "emmbbmmaaddsb" | awk '{ gsub(/mm+/, "\n" ); print; }'
e
bb
aaddsb
gsub() 中的“g”代表“全局”,意思是到处替换。
您也可以要求只打印 N 个匹配项,例如:
echo "emmbbmmaaddsb" | awk '{ gsub(/mm+/, " " ); print ; }'
bb