Bash 用于在 .CSV 逗号分隔文件中添加双引号的脚本

Question

我需要在 csv 文件中添加双引号。我的示例数据是这样的..

378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29 06:10:01",200890899011202395,0899,0279395,356058102052972,89148000005117597971,67756296

我尝试了一些在线可用的代码 awk 和 sed，结果如下所示，错误 - **数字中的第一位数字正在像 ex 一样被修剪。在“378478”中它只显示“78478”。

它也在为现有的双引号添加双引号！** 似乎没有什么是完美的。请指导我！

"78478","COMPLETED","Tracfone","","",""2020/03/29 09:39:22"","","2787","","356074101197544","89148000005748235454","75176540"
"78328","COMPLETED",""Total Wireless"",""Unlimited Talk"," Text"," & Data (First 25GB High Speed"," then unlimited 2GB)"","50",""2020/03/29 06:10:01"","200890899011202395","0899","0279395","356058102052972","89148000005117597971","67756296"
"78329","COMPLETED",""Cricket Wireless"",""Unlimited Talk"," Text"," & 4G LTE Data w/ 15GB Hotspot"","60",""2020/03/29""

这是我使用的代码：

awk -F"'?,'?" -v OFS='","' '{=; gsub(/^.|$/,"\"")} 1' file # or
sed -E 's/([^,]*) , (.*)/"" , ""/' file

我的总代码如下。我的意图是首先将所有 .xlsx 转换为 .csv，然后将双引号添加到同一个 csv 并将其保存在同一个 file.i 知道 $file.csv 部分是错误的，因此我需要一些帮助

find "$Src_Dir" -type f -iname "*.xlsx" -print>path/temp

cat path/temp | while IFS="" read -r -d $'[=14=]' file; 
do
    echo $file
    ssconvert "${file}" --export-type=Gnumeric_stf:stf_csv
    awk -F"'?,'?" -v OFS='","' '{=; gsub(/^.|$/,"\"")} 1' $file > $file.csv
done

Answer 1

如果您想处理除 最简单的 CSV 文件之外的任何其他内容，您应该将移开从 sed 和awk。有更好的工具可用。

例如，如果您 sudo apt install csvtool（或同等版本）在您最喜欢的发行版上，您可以使用它的每行调用功能来处理输入文件中的每一行。有关示例，请参见以下脚本：

#!/bin/bash

function quotify {
  # Start empty line, process every field.

  line=""
  while [[ $# -ne 0 ]] ; do
      #    Append comma for all but first field, then quoted field.

      [[ -n "${line}" ]] && line="${line},"
      line="${line}\"\""

      shift
  done

  # Output the fully quoted line.

  echo "${line}"
}

# Needed to call functions. Also, ensure link: /bin/sh -> /bin/bash.
export -f quotify

# Pretty-print input and output.

echo "Input file:"
sed 's/^/   /' inputFile.csv

echo "Output file:"
csvtool call quotify inputFile.csv | sed 's/^/   /'

注意为 CSV 文件中的每个行调用的 quotify 函数，参数设置为每个字段在该行内（无引号，原始字段是否有引号）。

它基本上构造了行中所有字段的字符串，并在它们周围加上引号，然后将其写入标准输出，如下面的脚本输出所示：

Input file:
   378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
   378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29"
Output file:
   "378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
   "378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29"

尽管使用单独的工具可能是最简单的方法，但如果您绝对不能安装其他软件包，那么您将不得不在您已经拥有的包裹。以下 bash 脚本是一个很好的起点，因为它不使用其他工具来实现其目标。

目前，它与一组非常具体的规则相关联，如下所示：

白色 space 很重要。逗号之间的任何内容都被视为字段的一部分。这在检测引用字段时尤其重要，它必须将引号作为第一个字符，没有 abc, "d,e,f",ghi 东西，因为 "d,e,f" 不会被正确处理。
带引号的字段允许包含逗号，其中的 "" 序列将变成 ".
提供格式错误的 CSV 文件可能不是一个好主意:-)

但是，考虑到这一点，我们开始吧。我将提供每个部分的简短文本描述，但希望代码中的注释足以弄清楚发生了什么。

首先，一个函数用于查找某个字符串在另一个字符串中的位置，对于计算字段边界很有用：

function findPos {
    haystack=""
    needle=""

    # Remove everything past the needle.

    prefix="${haystack%%${needle}*}"

    # If nothing was removed, it wasn't found, so supply massive number.
    # Otherwise, it was found at the length of the string with removed stuff.

    position=999999
    [[ ${#prefix} -ne ${#haystack} ]] && position=${#prefix}
    echo ${position}
}

然后我们可以在计算下一个字段长度的函数中使用它。这基本上只是为未引用的字段寻找下一个逗号，并通过从段构建字段来对引用的字段进行特殊处理（它必须处理引号和逗号内的引号）：

function getNextFieldLen {
    line=""

    # Empty line means all work done.

    [[ -z "${line}" ]] && echo -1 && return

    # Handle unquoted first, this is easy.

    [[ "${line:0:1}" != '"' ]] && { echo $(findPos "${line}" ","); return; }

    # Now handle quoted. Loop over all segments where a segment is defined as
    # the text up to the next <"">, assuming it's before the next <",>.

    field=""
    nextQuoteComma=$(findPos "${line}" '",')
    nextDoubleQuote=$(findPos "${line}" '""')
    while [[ ${nextDoubleQuote} -lt ${nextQuoteComma} ]]; do
        # Append segment to the field and go back for next segment.

        field="${field}${line:0:${nextDoubleQuote}}\"\""
        line="${line:${nextDoubleQuote}}"
        line="${line:2}"

        nextQuoteComma=$(findPos "${line}" '",')
        nextDoubleQuote=$(findPos "${line}" '""')
    done

    # Add final segment (up to the comma) and output entire field.

    field="${field}${line:0:${nextQuoteComma}}\""
    echo "${#field}"
}

最后，还有一个顶级函数，它将引用通过标准输入输入的任何内容：

function quotifyStdIn {
    # Process file line by line.

    while read -r line; do
        # Start with empty output line and non-comma separator.

        outLine="" ; sep=""

        # Place terminator to make processing easier, start field loop.

        line="${line},"
        fieldLen=$(getNextFieldLen "${line}")
        while [[ ${fieldLen} -ge 0 ]]; do
            # Get field and quotify if needed, adjust line (remove field and comma).

            field="${line:0:${fieldLen}}"
            [[ "${field:0:1}" = '"' ]] || field="\"${field}\""

            line="${line:$((fieldLen+1))}"
            #line="${line:${fieldLen}}"
            #line="${line:1}"

            # Append to output line and prepare for next field.

            outLine="${outLine}${sep}${field}"; sep=","

            fieldLen=$(getNextFieldLen "${line}")
        done

        # Output built line.

        echo "${outLine}"
    done
}

而且，如果您想直接从文件中读取（虽然提供一个空文件名或 "-" 将使用标准输入，因此您可能只使用基于文件的函数一切):

function quotifyFile {
    file=""

    # Empty file or "-" means standard input, otherwise take input from real file.

    [[ ${#file} -eq 0 ]] && { quotifyStdIn; return; }
    [[ "${file}" = "-" ]] && { quotifyStdIn; return; }

    quotifyStdIn < "${file}"
}

最后，因为每个不是 "Hello, world" 的程序都值得某种形式的测试工具，这就是您可以用来测试各种功能的工具：

(
    echo 'paxdiablo,was here'
    echo 'and,"then, strangely,",he,was,not'
    echo '50,"My name is ""Pax"", and yours is ""Bob""",42'
    echo '17,"""Love"" is grand",19'
) > harness.csv

echo "Before:"
sed "s/^/   /" harness.csv
echo "After:"
quotifyFile harness.csv | sed "s/^/   /"

rm -rf harness.csv

而且，由于除非您运行测试，否则测试工具几乎没有用，这里是第一个运行的结果：

Before:
   paxdiablo,was here
   and,"then, strangely,",he,was,not
   50,"My name is ""Pax"", and yours is ""Bob""",42
   17,"""Love"" is grand",19
After:
   "paxdiablo","was here"
   "and","then, strangely,","he","was","not"
   "50","My name is ""Pax"", and yours is ""Bob""","42"
   "17","""Love"" is grand","19"

希望这足以让您在无法安装软件包的情况下继续前进。当然，如果您无法在 bash 中安装其中一个软件包，那么您遇到了我无法帮助您解决的问题:-)

Answer 2

您的起始 CSV 不是一个好的 CSV：2 行的列数不同

+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 1      | 2         | 3              | 4                                                                        | 5  | 6                   | 7 | 8    | 9 | 10              | 11                   | 12       |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 378478 | COMPLETED | Tracfone       | -                                                                        | -  | 2020/03/29 09:39:22 | - | 2787 | - | 356074101197544 | 89148000005748235454 | 75176540 |
| 378328 | COMPLETED | Total Wireless | Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB) | 50 | 2020/03/29          | - | -    | - | -               | -                    | -        |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+

使用 Miller (https://github.com/johnkerl/miller) 你可以运行

mlr --csv --quote-all -N unsparsify input >output

有

"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29","","","","","",""

您可以使用它下载可执行文件https://github.com/johnkerl/miller/releases/tag/v5.7.0

Bash 用于在 .CSV 逗号分隔文件中添加双引号的脚本

Bash script to add double quotes in .CSV comma delimited file

csv

bash

quotes

awk

comma