使用单独的行号文件拆分文本文件

Question

我有一个文本文件，其中包含多个 (~80) 个 OneNote 页面，所有页面都连接在一起，我试图将其拆分为每个页面的组件文件。我试图通过页面标题的行号来做到这一点，因为页面的长度是可变的，虽然我已经能够将行号提取到一个单独的文件中，但我无法弄清楚如何与他们分道扬镳。例如

Log.txt:

Tuning             //Page Title
09 November 2016   //Date
23:19              //Time
 
Content text...    //Page Content
 
Week 46            //Another title, want to split here
14 November 2016
13:47
 
Text..
More text...       //Content can be over multiple lines

Week 47            //Another title, want to split here
22 November 2016
11:15

Text
etc...

单独文件中的行号： Lines.txt:

1
7
14

此示例中的预期输出将提供三个文件，每个文件从页面标题向下到下一页标题之前的最后一行。

log1.txt log2.txt log3.txt

$ cat log1.txt
Tuning             
09 November 2016
23:19

Content text...

$

我找到了很多关于拆分成固定块（例如，每 50 行）的答案，这在这里不起作用，因为这些部分的长度是可变的。大多数围绕固定行号拆分的人只处理几个可以硬编码的行号，例如 using head or tail commands.

非常接近我要找的东西，但是要分割的行号输入非常小，可以直接写入命令。我不知道如何使用行号文件代替将其作为字符串“1 7 14”等写入。

我在 macOS 上使用 bash，对命令行的这种级别工作还很陌生，没有使用 grep、sed、awk 等的实际经验，所以我很难概括此特定案例的其他答案。

PS 如有必要，我可以包括我用来获取行号的代码，尽管我确信它远非最佳。（它涉及使用正则表达式查找时间戳的行号，然后剥离匹配的文本并从每行中减去 2 以获得页面标题）

Answer 1

Bash和awk解决方案

# Assumption: You have a bash array named arr with the indices you want,
# like this
arr=( 1 7 14 )

counter=1

for ((i=0; i<${#arr[@]}-1; i++)); do
    # Get current index
    index="${arr[$i]}"
    # Get next index
    next_index="${arr[$i+1]}"

    awk "NR>=$index && NR<$next_index" file_to_chop.txt > "log${counter}.txt"

    (( counter++ ))
done

# If the array is non-empty, we also need to write last set of lines
# to the last file
[ "${#arr[@]}" -gt 1 ] && {
    # Get last element in the array
    index="${arr[${#arr[@]}-1]}"

    awk "NR>=$index" file_to_chop.txt > "log${counter}.txt"
}

此脚本不适用于狭义 POSIX 兼容的 shell，因为它使用了多个“bash主义”，包括 (()) 中的算术。

这主要是通过使用 awk 的 NR 来实现的，它给出了记录号。表达式

NR>=3

例如，告诉 awk 只对记录号大于或等于 3 的记录（或在我们的例子中，行）执行操作（或在我们的例子中，打印）。更复杂的布尔表达式涉及 NR 可以使用 && 生成，例如

NR>=3 && NR<=7

如果 bash 数组中还没有索引，您可以从这样的文件生成数组：

arr=()
while read -r line; do arr+=( "$line" ); done < /path/to/your/file/here

或者，如果您想从命令的输出生成数组：

arr=()
while read -r line; do arr+=( "$line" ); done < <(your_command_here)

Python解决方案

import sys


def write_lines(filename, lines):
    try:
        with open(filename, 'w') as f:
            f.write('\n'.join(lines))
    except OSError:
        print(f'Error: failed to write to "{filename}".', file=sys.stderr)
        exit(1)


if len(sys.argv) != 2:
    print('Must pass path to input file.', file=sys.stderr)
    exit(1)

input_file = sys.argv[1]
line_indices = [line.rstrip() for line in sys.stdin]

try:
    with open(input_file, 'r') as f:
        input_lines = [line.rstrip() for line in f]
except OSError:
    print(f'Error: failed to read from "{input_file}".', file=sys.stderr)
    exit(1)

counter = 1

while len(line_indices) > 1:
    index = int(line_indices.pop(0))
    next_index = int(line_indices[0])

    write_lines(f'log{counter}.txt', input_lines[index-1:next_index-1])

    counter += 1

if line_indices:
    index = int(line_indices[0])

    write_lines(f'log{counter}.txt', input_lines[index-1:])

这是用法，假设你想剪切一个文件，所以第1-6行输出到log1.txt，第7-13行输出到log2.txt，第14行及以后的输出到 log3.txt:

printf '1\n7\n14\n' | python chop_file_script.py /path/to/file/to/chop

这个操作的方式是通过阅读 stdin 来了解如何将输入文件分割成单独的文件。这是设计使然，因此可以使用管道将所需的行号从父 shell 脚本馈送到脚本（如上面的用法示例所示）。

这不是一个完全可靠的脚本。它不处理诸如

之类的事情

stdin 中的行号未按升序排列
stdin 包含非数字值
stdin 中的数字超过了输入文件的长度

我认为这个脚本不完全健壮是可以的，因为只要以预期的方式使用它，它应该可以正常工作。

Answer 2

请您尝试 bash 和 sed 的组合:

#!/bin/bash

mapfile -t lines < "lines.txt"                  # read "lines" file and assign array "lines"
for (( i = 0; i < ${#lines[@]}; i++ )); do      # loop over the array "lines"
    start=${lines[i]}                           # start line
    if (( i == ${#lines[@]} - 1 )); then        # for the last element
        end="$"                                 # end line = "$"
    else                                        # otherwise
        end=$(( ${lines[i+1]} - 1 ))            # end line = next start line - 1
    fi
    sed -n "${start},${end}p" "log.txt" > "log$(( i + 1 )).txt"
                                                # extract the lines and write into a separate file
done

使用单独的行号文件拆分文本文件

Splitting a text file using a separate file of line numbers

macos

bash

split

onenote

Bash和awk解决方案

Python解决方案