(sed/awk) 提取值文本文件并写入 csv（无模式）

Question

我有（几个）大文本文件，我想从中提取一些值以创建包含所有这些值的 csv 文件。

我目前的解决方案是对 sed 进行几次不同的调用，我从中保存值，然后有一个 python 脚本，我在其中将不同文件中的数据组合到一个 csv 文件中。但是，这很慢，我想加快速度。

我们称它为 my_file_1.txt 的文件具有如下所示的结构

lines I don't need
start value 123
lines I don't need
epoch 1
...
lines I don't need
some epoch 18 words
stop value 234
lines I don't need
words start value 345 more words
lines I don't need
epoch 1
...
lines I don't need
epoch 72
stop value 456
...

我想构建类似

的东西

file,start,stop,epoch,run
my_file_1.txt,123,234,18,1
my_file_1.txt,345,456,72,2
...

我怎样才能得到我想要的结果？它不一定是 Sed 或 Awk，只要我不需要安装新东西并且它相当快。

我真的没有任何使用 awk 的经验。使用 sed 我最好的猜测是

filename=
echo 'file,start,stop,epoch,run' > my_data.csv
sed -n '
  s/.*start value \([0-9]\+\).*/'"$filename"',,/
  h
  $!N
  /.*epoch \([0-9]\+\).*\n.*stop value\([0-9]\+\)/{s/,/}
  D
  T
  G
  P
' $filename | sed -z 's/,\n/,/' >> my_data.csv

然后处理没有得到运行号码。此外，这并不完全正确，因为 N 会吞噬一些“起始值”行，从而导致错误的结果。感觉用 awk 可以做得更简单。

它类似于 8992158 但我不能使用那个模式，而且我对 awk 的了解太少，无法重写它。

解决方案（编辑）

我对问题的描述不够笼统，所以我做了一些修改并修复了一些不一致的地方。

Awk（Rusty Lemur 的回答）

在这里，我从知道数字在行尾到使用 gensub 进行了概括。为此，我应该指定 awk 的版本并非在所有版本中都可用。

BEGIN {
  counter = 1 
  OFS = ","   # This is the output field separator used by the print statement
  print "file", "start", "stop", "epoch", "run"  # Print the header line
}

/start value/ {
  startValue = gensub(/.*start value ([0-9]+).*/, "\1", 1, [=14=]) 
}

/epoch/ {
  epoch = gensub(/.*epoch ([0-9]+).*/, "\1", 1, [=14=]) 
}

/stop value/ {
  stopValue = gensub(/.*stop value ([0-9]+).*/, "\1", 1, [=14=]) 
  
  # we have everything to print our line
  print FILENAME, startValue, stopValue, epoch, counter
  counter = counter + 1 
  startValue = "" # clear variables so they aren't maintained through the next iteration
  epoch = ""
}

我接受了这个答案，因为它最容易理解。

Sed（potong 的回答）

sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
        /^.*start value/{:a;N;/\n.*stop value/!ba;x
        s/.*/expr & + 1/e;x;G;F
        s/^.*start value (\S+).*\n.*epoch (\S+)\n.*stop value (\S+).*\n(\S+)/,,,,/p}' my_file_1.txt |         sed '1!N;s/\n//'

Answer 1

不清楚您如何从您提供的输入中准确地获得您提供的输出，但这可能是您想要做的（在每个 Unix 机器上的任何 shell 中使用任何 awk） :

$ cat tst.awk
BEGIN {
    OFS = ","
    print "file", "start", "stop", "epoch", "run"
}
{ f[] = $NF }
 == "stop" {
    print FILENAME, f["start"], f["stop"], f["epoch"], ++run
    delete f
}

$ awk -f tst.awk my_file_1.txt
file,start,stop,epoch,run
my_file_1.txt,123,234,N,1
my_file_1.txt,345,456,M,2

Answer 2

awk的基本结构是：

从输入中读取一条记录（默认一条记录是一行）
评估条件
应用操作

记录被拆分成字段（默认基于空格作为分隔符）。这些字段由它们的位置引用，从 1 开始。$1 是第一个字段，$2 是第二个。最后一个字段由名为 NF 的变量引用，表示“字段数”。 $NF 是最后一个字段，$(NF-1) 是 second-to-last 字段，等等

在读取任何输入文件之前将执行“BEGIN”部分，它可用于初始化变量（隐式初始化为 0）。

BEGIN {
  counter = 1
  OFS = ","   # This is the output field separator used by the print statement
  print "file", "start", "stop", "epoch", "run"  # Print the header line
}

/start value/ {
  startValue = $NF  # when a line contains "start value" store the last field as startValue 
}

/epoch/ {
  epoch = $NF
}

/stop value/ {
  stopValue = $NF

  # we have everything to print our line
  print FILENAME, startValue, stopValue, epoch, counter
  counter = counter + 1
  startValue = "" # clear variables so they aren't maintained through the next iteration
  epoch = ""
}

将其保存为 processor.awk 并调用为：

awk -f processor.awk my_file_1.txt my_file_2.txt my_file_3.txt > output.csv

Answer 3

这可能适合您 (GNU sed)：

sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
        /^start value/{:a;N;/\nstop value/!ba;x
        s/.*/expr & + 1/e;x;G;F
        s/^start value (\S+).*\nepoch (\S+)\nstop value (\S+).*\n(\S+)/,,,,/p}' file |
        sed '1!N;s/\n//'

该解决方案包含两次 sed 调用，第一次格式化除文件名以外的所有内容，第二次将文件名嵌入到 csv 文件中。

在第一行格式化 header 行并为运行数字添加素数。

收集 start value 和 stop value 之间的行。

增加运行数，追加到当前行并输出文件名。每条记录打印两行，第一行是文件名，第二行是 csv 文件的其余部分。

在第二次 sed 调用中，一次读取两行（第一行除外）并删除它们之间的换行符，格式化 csv 文件。

(sed/awk) 提取值文本文件并写入 csv（无模式）

(sed/awk) extract values text file and write to csv (no pattern)

csv

bash

awk

replace

sed

解决方案（编辑）

Awk（Rusty Lemur 的回答）

Sed（potong 的回答）