换行分隔线,保留第一列,最小最终长度
Wrapping delimited lines, retaining first column, with minimum final length
希望拆分内容行,保留标题。
我进行了大量的文本处理,我喜欢使用 unix one-liners 因为随着时间的推移它们对我来说很容易组织(相对于大量的脚本),我可以轻松地将它们链接在一起,并且我喜欢(重新)学习如何使用经典的 unix 函数。通常我会使用简短的 awk、perl 或 ruby one-liner,这取决于哪个最优雅。
这里我有 X 条 comma-delimited 项的行。我想把这些分开,保留词条。
输入:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
输出:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
算法详情:
- 输入行由一个词条组成,然后是 equals-sign,然后是逗号分隔的至少 1 项列表。
- 在这个例子中,大多数单词都是单字,但单词可以包含空格(例如末尾的“horseshoe crab”)
- 拆分为 9 个项目,除非有 <3 个,在这种情况下,最终拆分可能在一行上产生 12 个
- 有多行。例如下一行可能是行星。
我想到了转义空格,然后使用 unix fold,然后 awk 拉下第一列。这与上面的完全一样:
echo "animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab" \
| \tr ' ,' '_ ' \
| fold -s \
| perl -pe 's/=/\t/; s/^_/\t_/g;' \
| awk 'BEGIN{FS=OFS="\t"} ==""{=p} {p=} 1' \
| tr '\t _' '=, '
但它只考虑字符长度(不考虑项目数),而没有考虑我不希望 <3 个项目挂在最后一行的特殊情况。
我认为这是一个优雅的小谜题,有想法吗?
你可以考虑这个awk
:
awk 'BEGIN {FS=OFS=" = "} {
s =
while (match(s, /([^,]+, ){1,9}(([^,]+, ){2}[^,]+$)?/)) {
v = substr(s, RSTART, RLENGTH)
sub(/, $/, "", v)
print , v
s = substr(s, RLENGTH+1)
}
}' file
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
特别注意这里使用的正则表达式/([^,]+, ){1,9}(([^,]+, ){2}[^,]+$)?/
匹配以 ,
分隔符分隔的 1 到 9 个单词。此正则表达式还有一个可选部分,最多可匹配行尾前的 3 个单词。
仅使用您显示的示例,请尝试执行以下 awk
程序。在 GNU awk
中编写和测试应该在任何 awk
.
中工作
我在其中创建了一个名为 numberOfFields
的 awk
变量,其中包含您要打印的字段数(根据所示示例用换行分隔)。
awk -v numberOfFields="9" '
BEGIN{
FS=", ";OFS=", "
}
{
line=[=10=]
sub(/ = .*/,"",line)
sub(/^[^ ]* =[^ ]* /,"")
for(i=1;i<=NF;i++){
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\
(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{
print ""
}
' Input_file
OR 上面的代码在 2 行中有 printf
语句(出于可读性目的),如果你想要的话本身成一行,然后尝试以下操作:
awk -v numberOfFields="9" '
BEGIN{
FS=", ";OFS=", "
}
{
line=[=11=]
sub(/ = .*/,"",line)
sub(/^[^ ]* =[^ ]* /,"")
for(i=1;i<=NF;i++){
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{
print ""
}
' Input_file
说明: 为以上添加详细说明。
awk -v numberOfFields="9" ' ##Starting awk program from here, creating variable named numberOfFields and setting its value to 9 here.
BEGIN{ ##Starting BEGIN section of awk here.
FS=", ";OFS=", " ##Setting FS and OFS to comma space here.
}
{
line=[=12=] ##Setting value of [=12=] to line here.
sub(/ = .*/,"",line) ##Substituting space = space everything till last of value in line with NULL.
sub(/^[^ ]* =[^ ]* /,"") ##Substituting from starting till first occurrence of space followed by = followed by again first occurrence of space with NULL in current line.
for(i=1;i<=NF;i++){ ##Running for loop here for all fields.
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\ ##Using printf and its conditions are explained below of code.
(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{ ##Starting END block of this program from here.
print "" ##Printing newline here.
}
' Input_file ##Mentioning Input_file name here.
上面printf
条件的解释:
(
i%numberOfFields==0 ##checking if modules value of i%numberOfFields is 0 here, if this is TRUE:
?OFS $i ORS line" = " ##Then printing OFS $i ORS line" = "(comma space field value new line line variable and space = space)
:(i==1 ##If very first condition is FALSE then checking again if i==1
?line " = " $i ##Then print line variable followed by space = space followed by $i
:(i%numberOfFields>1?OFS $i:$i) ##Else if if modules value of i%numberOfFields is greater than 1 then print OFS $i else print $i.
)
)
一个awk
想法:
awk -F'[=,]' -v min=3 -v max=9 '
{ for (i=2; i<=NF; i++) {
if ( (i-1) % max == 1 && (NF-i+1 > min) ) {
if ( i > max ) print newline
newline= "="
pfx=""
}
newline=newline pfx $i
pfx=","
}
print newline
}
' raw.dat
示例数据:
$ cat raw.dat
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto, vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
使用 -v min=3 -v max=9
我们得到:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13
解决 OP 关于使用 one-liners 的评论 ...
虽然这个 awk
脚本肯定会被塞进 one-liner 我猜 OP 会 a) 发现它很难 edit/maintain 和 b) 如果不得不一遍又一遍地(重新)输入。
一个(显而易见的?)想法是将 awk
代码包装在一个函数中,例如:
splitme() {
awk -F'[=,]' -v min= -v max= '
{ for (i=2; i<=NF; i++) {
if ( (i-1) % max == 1 && (NF-i+1 > min) ) {
if ( i > max ) print newline
newline= "="
pfx=""
}
newline=newline pfx $i
pfx=","
}
print newline
}' "${3:--}"
}
备注:
- 参数化
min
和 max
值以便从命令行中提取
- 参数化文件引用以从命令行 (
</code>) 或标准输入 (<code>-
) 提取
- OP 可以根据需要向 verify/validate 输入参数添加更多逻辑
是否独立调用文件:
$ splitme 3 9 raw.dat
或在管道中调用:
$ cat raw.dat | splitme 3 9
两者都产生:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13
这里有两个Ruby解决方案来处理一行。变量 str
保存一行(示例中以 'animals = ...'
开头的行)。
#1 使用正则表达式
RGX = \A\w+| *= *|(?:[^,]+, *){0,10}[^,]+\z|(?:[^,]+, *){9}
def break_line(str)
headword, _, *lines = str.scan(RGX)
lines.each { |line| puts "#{headword} = #{line.sub(/, *\z/, '')}" }
end
brake_line(str)
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
正则表达式可以写成free-spacing模式使之self-documenting.
RGX =
/
\A # match beginning of string
\w+ # match one or more word chars (e.g., "animals")
| # or
[ ]*=[ ]* # "=" preceded and followed by zero or more spaces
| # or
(?: # begin a non-capture group
[^,]+ # match one or more chars other than a comma
,[ ]* # match a comma and zero or more spaces
){0,10} # end non-capture group and execute 0-10 times
[^,]+ # match one or more chars other than a comma
\z # match end of string
| # or
(?: # begin a non-capture group
[^,]+ # match one or more chars other than a comma
,[ ]* # match a comma and zero or more spaces
){9} # end non-capture group and execute 1-7 times
/x # invoke free-spacing regex definition mode
执行示例 str
时,我们会发现以下内容。
headword
#=> "animals"
_
#=> "="
lines
#=> ["lizard, bird, bee, snake, whale, eagle, beetle, ",
"mule, hare, goose, horse, mouse, pig, dog, ",
"frog, bug, fish, duck, camel, squirrel, owl, ",
"chicken, pigeon, lion, sheep, bear, spider, deer, ",
"tiger, lobster, dinosaur, cat, goat, rat, cricket, ",
"rabbit, elephant, crow, fox, donkey, monkey, butterfly, ",
"crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab"]
Ruby 有使用变量 _
的约定,当其值随后未在计算中使用时。这主要是为了so-inform reader.
#2 提取和分组单词
def break_line(str)
headword, *words = str.split(/ *[,=] */)
groups = words.each_slice(9).to_a
if groups[-1].size < 3
groups[-2] += groups[-1]
groups.pop
end
groups.each { |group| puts "#{headword} = #{group.join(', ')}" }
end
brake_line(str)
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
通过部分解释,我们将获得以下示例:
headword
#=> "animals"
words
#=> ["lizard", "bird",,..."horseshoe crab"]
groups
#=> [["lizard", "bird", "bee", "snake", "whale", "eagle",
"beetle", "mule", "hare"],
["goose", "horse", "mouse", "pig", "dog", "frog",
"bug", "fish", "duck"],
["camel", "squirrel", "owl", "chicken", "pigeon", "lion",
"sheep", "bear", "spider"],
["deer", "tiger", "lobster", "dinosaur", "cat", "goat",
"rat", "cricket", "rabbit"],
["elephant", "crow", "fox", "donkey", "monkey", "butterfly",
"crab", "leopard", "moth"],
["shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]]
由于groups
的元素包含两个以上的元素(它包含五个),因此groups
后续不会被修改。如果允许最后一行最多有 14 个(而不是 11 个)元素,它将被更改为
["elephant", "crow", "fox", "donkey", "monkey", "butterfly", "crab",
"leopard", "moth", "shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]
使用 Perl,一种方式
perl -wnE'
($head, @items) = split /\s*[,=]\s*/;
while (@items) {
@elems = splice @items, 0, 9;
if (@elems < 3) { $lines[-1] .= ", " . join ", ", @elems }
else { push @lines, join ", ", @elems }
}
say "$head = $_" for @lines; @lines = ()
' file
或
perl -wnE'
($head, @items) = split /\s*[,=]\s*/;
push @lines, join ", ", splice @items, 0, 9 while @items;
$lines[-2] .= ", " . pop @lines if 2 > $lines[-1] =~ tr/,//;
say "$head = $_" for @lines; @lines = ()
' file
为了便于阅读,分多行显示,可以 copy-pasted 进入 bash 终端,但也可以在一行中输入。使用添加的 11 (9+2) 项行进行测试。
备注
awk -F"[=,]" -v max="9" '{
for(i=2; i<=NF; i+=max){
row = ""
for(j=i; j<=i+max-1; j++){
row=row $(j) ","
}
gsub(/,+$/, "", row)
printf "%s=%s \n", , row
}
}' input_file
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers = 10, 11, 12, 13, 14, 15, 16
cars = mercedes benz, bmw, audi, vw, porsche, seat, skoda, opel, renault
cars = mazda, toyota, honda
花了一些时间修改我的解决方案,通过在正则表达式链的末尾执行 =
的等价物,使其在 gawk
和 mawk
上工作;
$(NF!=NF=NF)
在内部扩展为 NF != (NF=NF)
,这始终是错误的,所以整个事情只是意味着 [=16=]
,但在其中嵌入了 =
:
input ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
2 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX
command ::
[mg]awk '
BEGIN {
FS = (OFS = " = ") "*"
_=__ = (___="[^,]+")"[,]"
gsub(".",_,__)
__ = (__)_ "(("_")?("_")?"___"$)?"
_ = ORS } gsub(__,"&"_ OFS)+gsub("[,]"_,_)+sub((_)"+([^,]*)$","", $(NF!=NF=NF))'
output (mawk 1.3.4) ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
2 animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
3 animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
4 animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
5 animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
6 animals = shark, salmon, shrimp, mosquito, horseshoe crab
7 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX
output (gawk 5.1.1) ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
2 animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
3 animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
4 animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
5 animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
6 animals = shark, salmon, shrimp, mosquito, horseshoe crab
7 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX
希望拆分内容行,保留标题。
我进行了大量的文本处理,我喜欢使用 unix one-liners 因为随着时间的推移它们对我来说很容易组织(相对于大量的脚本),我可以轻松地将它们链接在一起,并且我喜欢(重新)学习如何使用经典的 unix 函数。通常我会使用简短的 awk、perl 或 ruby one-liner,这取决于哪个最优雅。
这里我有 X 条 comma-delimited 项的行。我想把这些分开,保留词条。
输入:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
输出:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
算法详情:
- 输入行由一个词条组成,然后是 equals-sign,然后是逗号分隔的至少 1 项列表。
- 在这个例子中,大多数单词都是单字,但单词可以包含空格(例如末尾的“horseshoe crab”)
- 拆分为 9 个项目,除非有 <3 个,在这种情况下,最终拆分可能在一行上产生 12 个
- 有多行。例如下一行可能是行星。
我想到了转义空格,然后使用 unix fold,然后 awk 拉下第一列。这与上面的完全一样:
echo "animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab" \
| \tr ' ,' '_ ' \
| fold -s \
| perl -pe 's/=/\t/; s/^_/\t_/g;' \
| awk 'BEGIN{FS=OFS="\t"} ==""{=p} {p=} 1' \
| tr '\t _' '=, '
但它只考虑字符长度(不考虑项目数),而没有考虑我不希望 <3 个项目挂在最后一行的特殊情况。
我认为这是一个优雅的小谜题,有想法吗?
你可以考虑这个awk
:
awk 'BEGIN {FS=OFS=" = "} {
s =
while (match(s, /([^,]+, ){1,9}(([^,]+, ){2}[^,]+$)?/)) {
v = substr(s, RSTART, RLENGTH)
sub(/, $/, "", v)
print , v
s = substr(s, RLENGTH+1)
}
}' file
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
特别注意这里使用的正则表达式/([^,]+, ){1,9}(([^,]+, ){2}[^,]+$)?/
匹配以 ,
分隔符分隔的 1 到 9 个单词。此正则表达式还有一个可选部分,最多可匹配行尾前的 3 个单词。
仅使用您显示的示例,请尝试执行以下 awk
程序。在 GNU awk
中编写和测试应该在任何 awk
.
我在其中创建了一个名为 numberOfFields
的 awk
变量,其中包含您要打印的字段数(根据所示示例用换行分隔)。
awk -v numberOfFields="9" '
BEGIN{
FS=", ";OFS=", "
}
{
line=[=10=]
sub(/ = .*/,"",line)
sub(/^[^ ]* =[^ ]* /,"")
for(i=1;i<=NF;i++){
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\
(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{
print ""
}
' Input_file
OR 上面的代码在 2 行中有 printf
语句(出于可读性目的),如果你想要的话本身成一行,然后尝试以下操作:
awk -v numberOfFields="9" '
BEGIN{
FS=", ";OFS=", "
}
{
line=[=11=]
sub(/ = .*/,"",line)
sub(/^[^ ]* =[^ ]* /,"")
for(i=1;i<=NF;i++){
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{
print ""
}
' Input_file
说明: 为以上添加详细说明。
awk -v numberOfFields="9" ' ##Starting awk program from here, creating variable named numberOfFields and setting its value to 9 here.
BEGIN{ ##Starting BEGIN section of awk here.
FS=", ";OFS=", " ##Setting FS and OFS to comma space here.
}
{
line=[=12=] ##Setting value of [=12=] to line here.
sub(/ = .*/,"",line) ##Substituting space = space everything till last of value in line with NULL.
sub(/^[^ ]* =[^ ]* /,"") ##Substituting from starting till first occurrence of space followed by = followed by again first occurrence of space with NULL in current line.
for(i=1;i<=NF;i++){ ##Running for loop here for all fields.
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\ ##Using printf and its conditions are explained below of code.
(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{ ##Starting END block of this program from here.
print "" ##Printing newline here.
}
' Input_file ##Mentioning Input_file name here.
上面printf
条件的解释:
(
i%numberOfFields==0 ##checking if modules value of i%numberOfFields is 0 here, if this is TRUE:
?OFS $i ORS line" = " ##Then printing OFS $i ORS line" = "(comma space field value new line line variable and space = space)
:(i==1 ##If very first condition is FALSE then checking again if i==1
?line " = " $i ##Then print line variable followed by space = space followed by $i
:(i%numberOfFields>1?OFS $i:$i) ##Else if if modules value of i%numberOfFields is greater than 1 then print OFS $i else print $i.
)
)
一个awk
想法:
awk -F'[=,]' -v min=3 -v max=9 '
{ for (i=2; i<=NF; i++) {
if ( (i-1) % max == 1 && (NF-i+1 > min) ) {
if ( i > max ) print newline
newline= "="
pfx=""
}
newline=newline pfx $i
pfx=","
}
print newline
}
' raw.dat
示例数据:
$ cat raw.dat
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto, vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
使用 -v min=3 -v max=9
我们得到:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13
解决 OP 关于使用 one-liners 的评论 ...
虽然这个 awk
脚本肯定会被塞进 one-liner 我猜 OP 会 a) 发现它很难 edit/maintain 和 b) 如果不得不一遍又一遍地(重新)输入。
一个(显而易见的?)想法是将 awk
代码包装在一个函数中,例如:
splitme() {
awk -F'[=,]' -v min= -v max= '
{ for (i=2; i<=NF; i++) {
if ( (i-1) % max == 1 && (NF-i+1 > min) ) {
if ( i > max ) print newline
newline= "="
pfx=""
}
newline=newline pfx $i
pfx=","
}
print newline
}' "${3:--}"
}
备注:
- 参数化
min
和max
值以便从命令行中提取 - 参数化文件引用以从命令行 (
</code>) 或标准输入 (<code>-
) 提取 - OP 可以根据需要向 verify/validate 输入参数添加更多逻辑
是否独立调用文件:
$ splitme 3 9 raw.dat
或在管道中调用:
$ cat raw.dat | splitme 3 9
两者都产生:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13
这里有两个Ruby解决方案来处理一行。变量 str
保存一行(示例中以 'animals = ...'
开头的行)。
#1 使用正则表达式
RGX = \A\w+| *= *|(?:[^,]+, *){0,10}[^,]+\z|(?:[^,]+, *){9}
def break_line(str)
headword, _, *lines = str.scan(RGX)
lines.each { |line| puts "#{headword} = #{line.sub(/, *\z/, '')}" }
end
brake_line(str)
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
正则表达式可以写成free-spacing模式使之self-documenting.
RGX =
/
\A # match beginning of string
\w+ # match one or more word chars (e.g., "animals")
| # or
[ ]*=[ ]* # "=" preceded and followed by zero or more spaces
| # or
(?: # begin a non-capture group
[^,]+ # match one or more chars other than a comma
,[ ]* # match a comma and zero or more spaces
){0,10} # end non-capture group and execute 0-10 times
[^,]+ # match one or more chars other than a comma
\z # match end of string
| # or
(?: # begin a non-capture group
[^,]+ # match one or more chars other than a comma
,[ ]* # match a comma and zero or more spaces
){9} # end non-capture group and execute 1-7 times
/x # invoke free-spacing regex definition mode
执行示例 str
时,我们会发现以下内容。
headword
#=> "animals"
_
#=> "="
lines
#=> ["lizard, bird, bee, snake, whale, eagle, beetle, ",
"mule, hare, goose, horse, mouse, pig, dog, ",
"frog, bug, fish, duck, camel, squirrel, owl, ",
"chicken, pigeon, lion, sheep, bear, spider, deer, ",
"tiger, lobster, dinosaur, cat, goat, rat, cricket, ",
"rabbit, elephant, crow, fox, donkey, monkey, butterfly, ",
"crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab"]
Ruby 有使用变量 _
的约定,当其值随后未在计算中使用时。这主要是为了so-inform reader.
#2 提取和分组单词
def break_line(str)
headword, *words = str.split(/ *[,=] */)
groups = words.each_slice(9).to_a
if groups[-1].size < 3
groups[-2] += groups[-1]
groups.pop
end
groups.each { |group| puts "#{headword} = #{group.join(', ')}" }
end
brake_line(str)
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
通过部分解释,我们将获得以下示例:
headword
#=> "animals"
words
#=> ["lizard", "bird",,..."horseshoe crab"]
groups
#=> [["lizard", "bird", "bee", "snake", "whale", "eagle",
"beetle", "mule", "hare"],
["goose", "horse", "mouse", "pig", "dog", "frog",
"bug", "fish", "duck"],
["camel", "squirrel", "owl", "chicken", "pigeon", "lion",
"sheep", "bear", "spider"],
["deer", "tiger", "lobster", "dinosaur", "cat", "goat",
"rat", "cricket", "rabbit"],
["elephant", "crow", "fox", "donkey", "monkey", "butterfly",
"crab", "leopard", "moth"],
["shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]]
由于groups
的元素包含两个以上的元素(它包含五个),因此groups
后续不会被修改。如果允许最后一行最多有 14 个(而不是 11 个)元素,它将被更改为
["elephant", "crow", "fox", "donkey", "monkey", "butterfly", "crab",
"leopard", "moth", "shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]
使用 Perl,一种方式
perl -wnE'
($head, @items) = split /\s*[,=]\s*/;
while (@items) {
@elems = splice @items, 0, 9;
if (@elems < 3) { $lines[-1] .= ", " . join ", ", @elems }
else { push @lines, join ", ", @elems }
}
say "$head = $_" for @lines; @lines = ()
' file
或
perl -wnE'
($head, @items) = split /\s*[,=]\s*/;
push @lines, join ", ", splice @items, 0, 9 while @items;
$lines[-2] .= ", " . pop @lines if 2 > $lines[-1] =~ tr/,//;
say "$head = $_" for @lines; @lines = ()
' file
为了便于阅读,分多行显示,可以 copy-pasted 进入 bash 终端,但也可以在一行中输入。使用添加的 11 (9+2) 项行进行测试。
备注
awk -F"[=,]" -v max="9" '{
for(i=2; i<=NF; i+=max){
row = ""
for(j=i; j<=i+max-1; j++){
row=row $(j) ","
}
gsub(/,+$/, "", row)
printf "%s=%s \n", , row
}
}' input_file
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers = 10, 11, 12, 13, 14, 15, 16
cars = mercedes benz, bmw, audi, vw, porsche, seat, skoda, opel, renault
cars = mazda, toyota, honda
花了一些时间修改我的解决方案,通过在正则表达式链的末尾执行 =
的等价物,使其在 gawk
和 mawk
上工作;
$(NF!=NF=NF)
在内部扩展为 NF != (NF=NF)
,这始终是错误的,所以整个事情只是意味着 [=16=]
,但在其中嵌入了 =
:
input ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
2 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX
command ::
[mg]awk '
BEGIN {
FS = (OFS = " = ") "*"
_=__ = (___="[^,]+")"[,]"
gsub(".",_,__)
__ = (__)_ "(("_")?("_")?"___"$)?"
_ = ORS } gsub(__,"&"_ OFS)+gsub("[,]"_,_)+sub((_)"+([^,]*)$","", $(NF!=NF=NF))'
output (mawk 1.3.4) ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
2 animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
3 animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
4 animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
5 animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
6 animals = shark, salmon, shrimp, mosquito, horseshoe crab
7 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX
output (gawk 5.1.1) ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
2 animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
3 animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
4 animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
5 animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
6 animals = shark, salmon, shrimp, mosquito, horseshoe crab
7 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX