使用 shell 脚本从具有 header 个标记的文本文件创建 CSV 文件

Question

我想从具有以下结构的目录中的一堆文本文件创建一个 CSV 文件，以便稍后将它们导入数据库。

Title:
Article title

Word Count:
100

Summary:
Article summary.

Can consist of multiple lines.
    
Keywords:
keyword1, keyword2, keyword3


Article Body:
The rest of the article body.

Till the end of the file.

所以期望的结果是将它们放在一个 CSV 文件中，其部分为 headers，其内容如下。

Title          | Word Count | Summary                  | Keywords                     | Article Body |
Article title  | 100        | Article summary.\nCan... | keyword1, keyword2, keyword3 | ...          |
Article2 title | 110        | Article summary.\nCan... | keyword1, keyword2, keyword3 | ...          |

我尝试了一些使用 awk 和 shell 脚本的方法，但到目前为止都没有成功。有什么想法吗？

Answer 1

根据 COPY, PostgreSQL fully supports the CSV format, and a Text format which is compatible with the lossless TSV 格式的文档。

因为我使用的是awk，所以我选择生成TSV。原因是数据中有换行符，而 POSIX awk 不允许在用户定义的变量中使用文字换行符。 TSV 没有这个问题，因为你必须用它们的 C 风格符号 \n.

替换文字换行符

此外，我更改了输入格式以使其更易于解析。新规则是一个或多个空行分隔记录，这意味着你不能在Summary或[=17的内容中有空行=]; work-around 是添加一个 space 字符，就像我在示例中所做的那样。

输入示例：

Title:
Article title

Word Count:
100

Summary:
Article summary.
 
Can consist of multiple lines.

Keywords:
keyword1, keyword2, keyword3


Article Body:
The rest of the article body.
 
Till the end of the file.

这里是 awk 命令，它接受多个文件作为参数：

_{编辑：为 header 添加了 TSV 转义/添加了基本注释/减少了代码大小}

awk -v RS='' -v FS='^$' -v OFS='\t' '
    FNR == 1 { ++onr } # the current file number is our "output record number"
    /^[^:\n]+:/ {
        # lossless TSV escaping
        gsub(/\/,"\\")
        gsub(/\n/,"\n")
        gsub(/\r/,"\r")
        gsub(/\t/,"\t")

        # get the current field name
        id = substr([=11=],1,index([=11=],":")-1)

        # strip the first line (NOTE: the newline character is escaped)
        sub(/^(\[^n]|[^\])*\n/,"")

        # save the data
        fields[id]           # keep track of the field names that we came across
        records[0,id] = id   # for the header line
        records[onr,id] = [=11=] # for the output record
    }
    END {
        # print the header (onr == 0) and the records (onr >= 1)
        for (i = 0; i <= onr; i++) {
            out = sep = ""
            for (id in fields) {
                out = out sep records[i,id]
                sep = OFS
            }
            print out
        }
    }
' *.txt

然后输出（为了更好的易读性，我用 | 替换了所有文字标签）：

Summary | Article Body | Word Count | Title | Keywords
Article summary.\n \nCan consist of multiple lines. | The rest of the article body.\n \nTill the end of the file. | 100 | Article title | keyword1, keyword2, keyword3

后记： 一旦你得到一个有效的 TSV 文件，你可以使用像 mlr 这样的工具将它转换成 CSV，JSON 等。 .. 但为了在 postgreSQL 中导入数据， 不需要。

SQL 语句将是这样的（未测试）：

COPY table_name FROM '/path/file.tsv' WITH HEADER;

^{备注：你不需要指定 FORMAT 和 DELIMITER 因为默认值已经是 text 和 \t}

Answer 2

我对@Fravadona 的脚本进行了一些更改并创建了一个插入语句。对我来说似乎更实用并且有效。不过回答真的很有帮助，只是在这里添加作为参考，可能对其他人有用。

awk -v RS='' -v FS='^$' -v OFS='\t' '
    FNR == 1 { fn++ }
    /^[^:]+:/ {
        fieldName = substr([=10=],1,index([=10=],":")-1)

        sub("^[^:]+:[^\n]*\n","")
        gsub(/\/,"\\")
        gsub(/\n/,"\n")
        gsub(/\r/,"\r")
        gsub(/\t/,"\t")

        header[fieldName]
        record[fn,fieldName] = [=10=]
    }
    END {
        ORS=""
        print "insert into article(summary, content, word_count, title, keywords) values(E7"

        for (i = 1; i <= fn; i++) {
            sep = "7,7"
            out = ""
            for (fieldName in header) {
                out = out record[fn,fieldName] sep
            }
            print substr(out,0,length(out)-2)")"
        }
    }
' x.txt

结果：

insert into article(summary, content, word_count, title, keywords) values(E'Article summary.\n \nCan consist of multiple lines.','The rest of the article body.\n \nTill the end of the file.','100','Article title','keyword1, keyword2, keyword3')

使用 shell 脚本从具有 header 个标记的文本文件创建 CSV 文件

Create CSV file from a text file with header tokens using shell scripting

csv

bash

shell

scripting

text