在 bash 中按上下文和大小拆分文件
Split file by context and size in bash
我有一组大文件必须拆分成 100MB 的部分。我 运行 遇到的问题是行由 ^B ASCII(或 \u002)字符终止。
因此,我需要能够获得 100MB 的部分(显然加上或减去几个字节),这也说明了行尾。
示例文件:
000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B
"line" 的大小可以变化。
我知道 split 和 csplit,但无法将两者结合起来。
#!/bin/bash
split -b 100m filename #splitting by size
csplit filename “/$(echo -e “\u002”)/+1” “{*}” #splitting by context
关于如何制作 100MB 块以保持线条完整的任何建议?作为旁注,我无法将行结尾更改为 \n 因为这会损坏文件,因为 ^B 之间的数据必须保留换行符(如果存在)。
以下将在本机中实现您的拆分逻辑 bash -- 执行起来不是很快,但它可以在任何地方使用 bash 无需第三方工具即可安装 运行:
#!/bin/bash
prefix=${1:-"out."} # first optional argument: output file prefix
max_size=${2:-$(( 1024 * 1024 * 100 ))} # 2nd optional argument: size in bytes
cur_size=0 # running count: size of current chunk
file_num=1 # current numeric suffix; starting at 1
exec >"$prefix$file_num" # open first output file
while IFS= read -r -d $'\x02' piece; do # as long as there's new input...
printf '%s\x02' "$piece" # write it to our current output file
cur_size=$(( cur_size + ${#piece} + 1 )) # add its length to our counter
if (( cur_size > max_size )); then # if our counter is over our maximum size...
(( ++file_num )) # increment the file counter
exec >"$prefix$file_num" # open a new output file
cur_size=0 # and reset the output size counter
fi
done
if [[ $piece ]]; then # if the end of input had content without a \x02 after it...
printf '%s' "$piece" # ...write that trailing content to our output file.
fi
一个依赖于 dd
的版本(这里是 GNU 版本;可以更改为可移植的),但是对于大输入应该更快:
#!/bin/bash
prefix=${1:-"out."} # first optional argument: output file prefix
file_num=1 # current numeric suffix; starting at 1
exec >"$prefix$file_num" # open first output file
while true; do
dd bs=1M count=100 # tell GNU dd to copy 100MB from stdin to stdout
if IFS= read -r -d $'\x02' piece; then # read in bash to the next boundary
printf '%s\x02' "$piece" # write that segment to stdout
exec >"$prefix$((++file_num))" # re-open stdout to point to the next file
else
[[ $piece ]] && printf '%s' "$piece" # write what's left after the last boundary
break # and stop
fi
done
# if our last file is empty, delete it.
[[ -s $prefix$file_num ]] || rm -f -- "$prefix$file_num"
我有一组大文件必须拆分成 100MB 的部分。我 运行 遇到的问题是行由 ^B ASCII(或 \u002)字符终止。
因此,我需要能够获得 100MB 的部分(显然加上或减去几个字节),这也说明了行尾。
示例文件:
000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B
"line" 的大小可以变化。
我知道 split 和 csplit,但无法将两者结合起来。
#!/bin/bash
split -b 100m filename #splitting by size
csplit filename “/$(echo -e “\u002”)/+1” “{*}” #splitting by context
关于如何制作 100MB 块以保持线条完整的任何建议?作为旁注,我无法将行结尾更改为 \n 因为这会损坏文件,因为 ^B 之间的数据必须保留换行符(如果存在)。
以下将在本机中实现您的拆分逻辑 bash -- 执行起来不是很快,但它可以在任何地方使用 bash 无需第三方工具即可安装 运行:
#!/bin/bash
prefix=${1:-"out."} # first optional argument: output file prefix
max_size=${2:-$(( 1024 * 1024 * 100 ))} # 2nd optional argument: size in bytes
cur_size=0 # running count: size of current chunk
file_num=1 # current numeric suffix; starting at 1
exec >"$prefix$file_num" # open first output file
while IFS= read -r -d $'\x02' piece; do # as long as there's new input...
printf '%s\x02' "$piece" # write it to our current output file
cur_size=$(( cur_size + ${#piece} + 1 )) # add its length to our counter
if (( cur_size > max_size )); then # if our counter is over our maximum size...
(( ++file_num )) # increment the file counter
exec >"$prefix$file_num" # open a new output file
cur_size=0 # and reset the output size counter
fi
done
if [[ $piece ]]; then # if the end of input had content without a \x02 after it...
printf '%s' "$piece" # ...write that trailing content to our output file.
fi
一个依赖于 dd
的版本(这里是 GNU 版本;可以更改为可移植的),但是对于大输入应该更快:
#!/bin/bash
prefix=${1:-"out."} # first optional argument: output file prefix
file_num=1 # current numeric suffix; starting at 1
exec >"$prefix$file_num" # open first output file
while true; do
dd bs=1M count=100 # tell GNU dd to copy 100MB from stdin to stdout
if IFS= read -r -d $'\x02' piece; then # read in bash to the next boundary
printf '%s\x02' "$piece" # write that segment to stdout
exec >"$prefix$((++file_num))" # re-open stdout to point to the next file
else
[[ $piece ]] && printf '%s' "$piece" # write what's left after the last boundary
break # and stop
fi
done
# if our last file is empty, delete it.
[[ -s $prefix$file_num ]] || rm -f -- "$prefix$file_num"