读取文件，一次多行（以分隔符分隔）

Question

我将大量 .bib BibTeX 条目保存在一个文件中。我想读取文件，将每篇文章的数据（基本上用@分隔）存储到一个变量中，提取特定字段，最后将字段（制表符分隔）输出到清理后的文件中。

输入：

@article{Author1_2020,
    year = 2020,
    month = {feb},
    publisher = {Wiley},
    ...
}
@article{Author2_2010,
    year = 2010,
    month = {jul},
    publisher = {Journal},
    ...
}

输出：

Wiley   2020    feb
Journal 2010    jul

代码：

while IFS='@' read -r entry; do
    p=$(grep "publisher =" <<< "$entry" | cut ...)
    y=$(grep "year =" <<< "$entry" | awk ...)
    m=$(grep "month =" <<< "$entry" | cut ...)
    echo "$p    $y  $m" >> cleaned_up.bib
done < global.bib
```sh

Is there a way to make the `while read` command in bash operate on delimited chunks of text at a time, instead of single lines? `sed`/`awk` solutions would be more than welcome.

Answer 1

使用 GNU awk：

awk '{print , , }' RS='}\n' FS='( +|{|}|,)' OFS='\t' global.bib

输出：

Wiley   2020    feb
Journal 2010    jul

我将输入记录分隔符 (RS) 设置为 } 后跟一个换行符。默认为换行符。

输入字段分隔符（FS）我设置为至少一个space（ +）或{或}或,. OFS 是输出字段分隔符。

具有相同输出的不同符号：

awk 'BEGIN{RS="}\n"; FS="( +|{|}|,)"; OFS="\t"} {print , , }' global.bib

Answer 2

使用 GNU ed 和 column

给定文件global.bib

@article{Author1_2020,
    year = 2020,
    month = {feb},
    publisher = {Wiley},
    mama = {foo},
    papa = {bar},
}
@article{Author2_2010,
    year = 2010,
    month = {jul},
    publisher = {Journal},
    mama = {foo},
    papa = {bar},
}
@article{Author3_2010,
    year = 2010,
    month = {aug},
    publisher = {Josh},
    mama = {foo},
    papa = {bar},
}
@article{Author4_2030,
    year = 2030,
    month = {dec},
    publisher = {Jetchisel},
    mama = {foo},
    blah = {qux},
    papa = {bar},
}

ed脚本，我们就叫它script.ed

g/./s/^@.*//\
s/^}.*//
v/^.*publisher =.*$\|^.*year =.*$\|^.*month =.*$\|^$/d
,s/^.*publisher = \|^.*year = \|^.*month = //
g/./s/}//\
s/{//\
s/,//
g/./s/$/ /
g/./;/^$/j
,s/\([^ ]*\) \([^ ]*\) \([^ ]*\)/  /
g/^$/d
,p
Q

现在运行针对文件的 ed 脚本并将其通过管道传送到带有 -t 标志的列。

ed -s global.bib < script.ed | column -t

输出

Wiley      2020  feb
Journal    2010  jul
Josh       2010  aug
Jetchisel  2030  dec

简要说明。

第 1 行和第 2 行，搜索整个文件 g 表示全局，将所有以 @ 和 } 开头的行替换为空，使这是一个空行。
\是一个续行。所以第 1 行和第 2 行只是一个，用新行分隔。
第 3 行，v 表示与 / / 中匹配的任何内容相反，在本例中为 publisher、year 和month加一个empty/blank行，删除，d表示删除。
第 4 行，,s 也是 g 的全局替代。删除 / / 中的任何内容，而不是删除包含它的行，只是删除它。
5~7行也接上了，有个尾部\，把/ /里面匹配的都去掉，就是{，} 和 ,
第 8 行在文件上添加尾随 space。
第9行，从文件开头加入一个非空行，g全局，直到遇到空行。
第10行，反向引用所有字段，按需要顺序打印。
第 11 行删除所有 empty/blank 行。
第 12 行 ,p 将所有输出打印到标准输出。
第 13 行，Q 即使缓冲区被修改也没有错误地退出，如果需要就地编辑文件，请将其更改为 w。
您可以运行 ed 脚本逐行只包含由 \ 分隔的所有行及其后的下一行，因为它只是一个 ed 调用.

有bash4+grep和column

#!/usr/bin/env bash

limit=3

while mapfile -n "$limit" -t array; (( ${#array[*]} )); do
  array=("${array[@]//[\}\{,]}")
  array=("${array[@]#*= }")
  printf '%s %s %s\n' "${array[2]}" "${array[0]}" "${array[1]}"
done < <(
  grep -E '^[[:space:]]*(publisher|year|month) = ' global.bib
) | column -t

Answer 3

另一种 awk 的方法应该可以在所有 awk 变体中移植，可以使用 '=' 作为字段分隔符，例如：

awk -F= '
    ~/[ ]*year/       { year = substr(,2,match(,/,/)-2) }
    ~/[ ]*month/      { month = substr(,3,match(,/,/)-4) }
    ~/[ ]*publisher/  { pub = substr(,3,match(,/,/)-4) }
    FNR>1 && ~/^@/    { print pub"\t"year"\t"month }
    END                 { print pub"\t"year"\t"month }
' list.bib

其中每个规则提取 year、month 或 publisher 并使用 substr() 和 match()。 END 规则用于打印收集到的最后一组值。

例子Use/Output

使用 list.bib 中的示例数据，执行命令将导致：

awk -F= '
    ~/[ ]*year/       { year = substr(,2,match(,/,/)-2) }
    ~/[ ]*month/      { month = substr(,3,match(,/,/)-4) }
    ~/[ ]*publisher/  { pub = substr(,3,match(,/,/)-4) }
    FNR>1 && ~/^@/    { print pub"\t"year"\t"month }
    END                 { print pub"\t"year"\t"month }
' list.bib
Wiley   2020    feb
Journal 2010    jul

Answer 4

使用 GNU awk :

awk 'match([=10=], /\s*(\S+)\s*=\s*\{?([^},]*)/, a) { r[a[1]]=a[2] }
     /^\}$/ { print r["publisher"], r["year"], r["month"] }
    ' OFS='\t' global.bib

Answer 5

每当输入数据有标签值对时，我发现最好先创建该映射的数组（下面的f[]），然后您可以按照您喜欢的顺序打印您喜欢的任何字段标签（名称）：

$ cat tst.awk
BEGIN {
    OFS="\t"
    numTags = split(flds,tags)
}
/^}/ {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        tag = tags[tagNr]
        printf "%s%s", f[tag], (tagNr<numTags ? OFS : ORS)
    }
    delete f
    next
}
{
    gsub(/^[[:space:]]+|[[:space:],]+$/,"")
    tag = val = [=10=]
    if ( sub(/^@/,"",tag) ) {
        sub(/\{.*/,"",tag)
        sub(/[^{]+\{|/,"",val)
    }
    else {
        sub(/[[:space:]]*=.*/,"",tag)
        sub(/[^=]+=[[:space:]]*/,"",val)
        gsub(/^\{|\}$/,"",val)
    }
    f[tag] = val
}

.

$ awk -v flds='publisher year month' -f tst.awk file
Wiley   2020    feb
Journal 2010    jul

.

$ awk -v flds='month year article publisher' -f tst.awk file
feb     2020    Author1_2020    Wiley
jul     2010    Author2_2010    Journal

鉴于上述方法，您可以简单地向 /^}/ { ... } 块内的代码添加比较，例如

if ( (f["publisher"] == "Wiley") && (f["year"] == 2020) ) {
    do whatever you like
}

或者您可以调整它，将您的输入转换为 CSV 或 JSON 或您喜欢的任何其他输出格式。

Answer 6

这可能适合您 (GNU sed)：

sed '/^@/{:a;N;/^}/M!ba;s/.*year = \(....\).*month = {\(...\)}.*publisher = {\([^}]*\)}.*/\t\t/}' file

收集以 @ 开头的行和以 } 开头的行之间的行。

使用模式匹配提取必填字段并使用制表符分隔结果。

N.B。在多行正则表达式上使用 M 标志，因为这些行以 space.

模式收集

读取文件，一次多行（以分隔符分隔）

Read file, multiple lines at a time (separated by delimiter)

regex

bash

awk

sed

bibtex