Bash 用于在 .CSV 逗号分隔文件中添加双引号的脚本
Bash script to add double quotes in .CSV comma delimited file
我需要在 csv 文件中添加双引号。我的示例数据是这样的..
378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29 06:10:01",200890899011202395,0899,0279395,356058102052972,89148000005117597971,67756296
我尝试了一些在线可用的代码 awk
和 sed
,结果如下所示,错误 - **数字中的第一位数字正在像 ex 一样被修剪。在“378478”中它只显示“78478”。
它也在为现有的双引号添加双引号!** 似乎没有什么是完美的。请指导我!
"78478","COMPLETED","Tracfone","","",""2020/03/29 09:39:22"","","2787","","356074101197544","89148000005748235454","75176540"
"78328","COMPLETED",""Total Wireless"",""Unlimited Talk"," Text"," & Data (First 25GB High Speed"," then unlimited 2GB)"","50",""2020/03/29 06:10:01"","200890899011202395","0899","0279395","356058102052972","89148000005117597971","67756296"
"78329","COMPLETED",""Cricket Wireless"",""Unlimited Talk"," Text"," & 4G LTE Data w/ 15GB Hotspot"","60",""2020/03/29""
这是我使用的代码:
awk -F"'?,'?" -v OFS='","' '{=; gsub(/^.|$/,"\"")} 1' file # or
sed -E 's/([^,]*) , (.*)/"" , ""/' file
我的总代码如下。我的意图是首先将所有 .xlsx 转换为 .csv,然后将双引号添加到同一个 csv 并将其保存在同一个 file.i 知道 $file.csv 部分是错误的,因此我需要一些帮助
find "$Src_Dir" -type f -iname "*.xlsx" -print>path/temp
cat path/temp | while IFS="" read -r -d $'[=14=]' file;
do
echo $file
ssconvert "${file}" --export-type=Gnumeric_stf:stf_csv
awk -F"'?,'?" -v OFS='","' '{=; gsub(/^.|$/,"\"")} 1' $file > $file.csv
done
如果您想处理除 最简单的 CSV 文件之外的任何其他内容,您应该将 移开 从 sed
和awk
。有更好的工具可用。
例如,如果您 sudo apt install csvtool
(或同等版本)在您最喜欢的发行版上,您可以使用它的每行调用功能来处理输入文件中的每一行。有关示例,请参见以下脚本:
#!/bin/bash
function quotify {
# Start empty line, process every field.
line=""
while [[ $# -ne 0 ]] ; do
# Append comma for all but first field, then quoted field.
[[ -n "${line}" ]] && line="${line},"
line="${line}\"\""
shift
done
# Output the fully quoted line.
echo "${line}"
}
# Needed to call functions. Also, ensure link: /bin/sh -> /bin/bash.
export -f quotify
# Pretty-print input and output.
echo "Input file:"
sed 's/^/ /' inputFile.csv
echo "Output file:"
csvtool call quotify inputFile.csv | sed 's/^/ /'
注意为 CSV 文件中的每个 行 调用的 quotify
函数,参数设置为每个 字段 在该行内(无引号,原始字段是否有引号)。
它基本上构造了行中所有字段的字符串,并在它们周围加上引号,然后将其写入标准输出,如下面的脚本输出所示:
Input file:
378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29"
Output file:
"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29"
尽管使用单独的工具可能是最简单的方法,但如果您绝对不能安装其他软件包,那么您将不得不在您已经拥有的包裹。以下 bash
脚本是一个很好的起点,因为它不使用其他工具来实现其目标。
目前,它与一组非常具体的规则相关联,如下所示:
- 白色 space 很重要。逗号之间的任何内容都被视为字段的一部分。这在检测引用字段时尤其重要,它 必须 将引号作为第一个字符,没有
abc, "d,e,f",ghi
东西,因为 "d,e,f"
不会被正确处理。
- 带引号的字段允许包含逗号,其中的
""
序列将变成 "
.
- 提供格式错误的 CSV 文件可能不是一个好主意:-)
但是,考虑到这一点,我们开始吧。我将提供每个部分的简短文本描述,但希望代码中的注释足以弄清楚发生了什么。
首先,一个函数用于查找某个字符串在另一个字符串中的位置,对于计算字段边界很有用:
function findPos {
haystack=""
needle=""
# Remove everything past the needle.
prefix="${haystack%%${needle}*}"
# If nothing was removed, it wasn't found, so supply massive number.
# Otherwise, it was found at the length of the string with removed stuff.
position=999999
[[ ${#prefix} -ne ${#haystack} ]] && position=${#prefix}
echo ${position}
}
然后我们可以在计算下一个字段长度的函数中使用它。这基本上只是为未引用的字段寻找下一个逗号,并通过从段构建字段来对引用的字段进行特殊处理(它必须处理引号和逗号内的引号):
function getNextFieldLen {
line=""
# Empty line means all work done.
[[ -z "${line}" ]] && echo -1 && return
# Handle unquoted first, this is easy.
[[ "${line:0:1}" != '"' ]] && { echo $(findPos "${line}" ","); return; }
# Now handle quoted. Loop over all segments where a segment is defined as
# the text up to the next <"">, assuming it's before the next <",>.
field=""
nextQuoteComma=$(findPos "${line}" '",')
nextDoubleQuote=$(findPos "${line}" '""')
while [[ ${nextDoubleQuote} -lt ${nextQuoteComma} ]]; do
# Append segment to the field and go back for next segment.
field="${field}${line:0:${nextDoubleQuote}}\"\""
line="${line:${nextDoubleQuote}}"
line="${line:2}"
nextQuoteComma=$(findPos "${line}" '",')
nextDoubleQuote=$(findPos "${line}" '""')
done
# Add final segment (up to the comma) and output entire field.
field="${field}${line:0:${nextQuoteComma}}\""
echo "${#field}"
}
最后,还有一个顶级函数,它将引用通过标准输入输入的任何内容:
function quotifyStdIn {
# Process file line by line.
while read -r line; do
# Start with empty output line and non-comma separator.
outLine="" ; sep=""
# Place terminator to make processing easier, start field loop.
line="${line},"
fieldLen=$(getNextFieldLen "${line}")
while [[ ${fieldLen} -ge 0 ]]; do
# Get field and quotify if needed, adjust line (remove field and comma).
field="${line:0:${fieldLen}}"
[[ "${field:0:1}" = '"' ]] || field="\"${field}\""
line="${line:$((fieldLen+1))}"
#line="${line:${fieldLen}}"
#line="${line:1}"
# Append to output line and prepare for next field.
outLine="${outLine}${sep}${field}"; sep=","
fieldLen=$(getNextFieldLen "${line}")
done
# Output built line.
echo "${outLine}"
done
}
而且,如果您想直接从文件中读取(虽然提供一个空文件名或 "-"
将使用标准输入,因此您可能只使用基于文件的函数一切):
function quotifyFile {
file=""
# Empty file or "-" means standard input, otherwise take input from real file.
[[ ${#file} -eq 0 ]] && { quotifyStdIn; return; }
[[ "${file}" = "-" ]] && { quotifyStdIn; return; }
quotifyStdIn < "${file}"
}
最后,因为 每个 不是 "Hello, world" 的程序都值得某种形式的测试工具,这就是您可以用来测试各种功能的工具:
(
echo 'paxdiablo,was here'
echo 'and,"then, strangely,",he,was,not'
echo '50,"My name is ""Pax"", and yours is ""Bob""",42'
echo '17,"""Love"" is grand",19'
) > harness.csv
echo "Before:"
sed "s/^/ /" harness.csv
echo "After:"
quotifyFile harness.csv | sed "s/^/ /"
rm -rf harness.csv
而且,由于除非您 运行 测试,否则测试工具几乎没有用,这里是第一个 运行 的结果:
Before:
paxdiablo,was here
and,"then, strangely,",he,was,not
50,"My name is ""Pax"", and yours is ""Bob""",42
17,"""Love"" is grand",19
After:
"paxdiablo","was here"
"and","then, strangely,","he","was","not"
"50","My name is ""Pax"", and yours is ""Bob""","42"
"17","""Love"" is grand","19"
希望这足以让您在无法安装软件包的情况下继续前进。当然,如果您无法在 bash
中安装其中一个软件包,那么您遇到了我 无法 帮助您解决的问题:-)
您的起始 CSV 不是一个好的 CSV:2 行的列数不同
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 378478 | COMPLETED | Tracfone | - | - | 2020/03/29 09:39:22 | - | 2787 | - | 356074101197544 | 89148000005748235454 | 75176540 |
| 378328 | COMPLETED | Total Wireless | Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB) | 50 | 2020/03/29 | - | - | - | - | - | - |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
使用 Miller (https://github.com/johnkerl/miller) 你可以 运行
mlr --csv --quote-all -N unsparsify input >output
有
"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29","","","","","",""
您可以使用它下载可执行文件https://github.com/johnkerl/miller/releases/tag/v5.7.0
我需要在 csv 文件中添加双引号。我的示例数据是这样的..
378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29 06:10:01",200890899011202395,0899,0279395,356058102052972,89148000005117597971,67756296
我尝试了一些在线可用的代码 awk
和 sed
,结果如下所示,错误 - **数字中的第一位数字正在像 ex 一样被修剪。在“378478”中它只显示“78478”。
它也在为现有的双引号添加双引号!** 似乎没有什么是完美的。请指导我!
"78478","COMPLETED","Tracfone","","",""2020/03/29 09:39:22"","","2787","","356074101197544","89148000005748235454","75176540"
"78328","COMPLETED",""Total Wireless"",""Unlimited Talk"," Text"," & Data (First 25GB High Speed"," then unlimited 2GB)"","50",""2020/03/29 06:10:01"","200890899011202395","0899","0279395","356058102052972","89148000005117597971","67756296"
"78329","COMPLETED",""Cricket Wireless"",""Unlimited Talk"," Text"," & 4G LTE Data w/ 15GB Hotspot"","60",""2020/03/29""
这是我使用的代码:
awk -F"'?,'?" -v OFS='","' '{=; gsub(/^.|$/,"\"")} 1' file # or
sed -E 's/([^,]*) , (.*)/"" , ""/' file
我的总代码如下。我的意图是首先将所有 .xlsx 转换为 .csv,然后将双引号添加到同一个 csv 并将其保存在同一个 file.i 知道 $file.csv 部分是错误的,因此我需要一些帮助
find "$Src_Dir" -type f -iname "*.xlsx" -print>path/temp
cat path/temp | while IFS="" read -r -d $'[=14=]' file;
do
echo $file
ssconvert "${file}" --export-type=Gnumeric_stf:stf_csv
awk -F"'?,'?" -v OFS='","' '{=; gsub(/^.|$/,"\"")} 1' $file > $file.csv
done
如果您想处理除 最简单的 CSV 文件之外的任何其他内容,您应该将 移开 从 sed
和awk
。有更好的工具可用。
例如,如果您 sudo apt install csvtool
(或同等版本)在您最喜欢的发行版上,您可以使用它的每行调用功能来处理输入文件中的每一行。有关示例,请参见以下脚本:
#!/bin/bash
function quotify {
# Start empty line, process every field.
line=""
while [[ $# -ne 0 ]] ; do
# Append comma for all but first field, then quoted field.
[[ -n "${line}" ]] && line="${line},"
line="${line}\"\""
shift
done
# Output the fully quoted line.
echo "${line}"
}
# Needed to call functions. Also, ensure link: /bin/sh -> /bin/bash.
export -f quotify
# Pretty-print input and output.
echo "Input file:"
sed 's/^/ /' inputFile.csv
echo "Output file:"
csvtool call quotify inputFile.csv | sed 's/^/ /'
注意为 CSV 文件中的每个 行 调用的 quotify
函数,参数设置为每个 字段 在该行内(无引号,原始字段是否有引号)。
它基本上构造了行中所有字段的字符串,并在它们周围加上引号,然后将其写入标准输出,如下面的脚本输出所示:
Input file:
378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29"
Output file:
"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29"
尽管使用单独的工具可能是最简单的方法,但如果您绝对不能安装其他软件包,那么您将不得不在您已经拥有的包裹。以下 bash
脚本是一个很好的起点,因为它不使用其他工具来实现其目标。
目前,它与一组非常具体的规则相关联,如下所示:
- 白色 space 很重要。逗号之间的任何内容都被视为字段的一部分。这在检测引用字段时尤其重要,它 必须 将引号作为第一个字符,没有
abc, "d,e,f",ghi
东西,因为"d,e,f"
不会被正确处理。 - 带引号的字段允许包含逗号,其中的
""
序列将变成"
. - 提供格式错误的 CSV 文件可能不是一个好主意:-)
但是,考虑到这一点,我们开始吧。我将提供每个部分的简短文本描述,但希望代码中的注释足以弄清楚发生了什么。
首先,一个函数用于查找某个字符串在另一个字符串中的位置,对于计算字段边界很有用:
function findPos {
haystack=""
needle=""
# Remove everything past the needle.
prefix="${haystack%%${needle}*}"
# If nothing was removed, it wasn't found, so supply massive number.
# Otherwise, it was found at the length of the string with removed stuff.
position=999999
[[ ${#prefix} -ne ${#haystack} ]] && position=${#prefix}
echo ${position}
}
然后我们可以在计算下一个字段长度的函数中使用它。这基本上只是为未引用的字段寻找下一个逗号,并通过从段构建字段来对引用的字段进行特殊处理(它必须处理引号和逗号内的引号):
function getNextFieldLen {
line=""
# Empty line means all work done.
[[ -z "${line}" ]] && echo -1 && return
# Handle unquoted first, this is easy.
[[ "${line:0:1}" != '"' ]] && { echo $(findPos "${line}" ","); return; }
# Now handle quoted. Loop over all segments where a segment is defined as
# the text up to the next <"">, assuming it's before the next <",>.
field=""
nextQuoteComma=$(findPos "${line}" '",')
nextDoubleQuote=$(findPos "${line}" '""')
while [[ ${nextDoubleQuote} -lt ${nextQuoteComma} ]]; do
# Append segment to the field and go back for next segment.
field="${field}${line:0:${nextDoubleQuote}}\"\""
line="${line:${nextDoubleQuote}}"
line="${line:2}"
nextQuoteComma=$(findPos "${line}" '",')
nextDoubleQuote=$(findPos "${line}" '""')
done
# Add final segment (up to the comma) and output entire field.
field="${field}${line:0:${nextQuoteComma}}\""
echo "${#field}"
}
最后,还有一个顶级函数,它将引用通过标准输入输入的任何内容:
function quotifyStdIn {
# Process file line by line.
while read -r line; do
# Start with empty output line and non-comma separator.
outLine="" ; sep=""
# Place terminator to make processing easier, start field loop.
line="${line},"
fieldLen=$(getNextFieldLen "${line}")
while [[ ${fieldLen} -ge 0 ]]; do
# Get field and quotify if needed, adjust line (remove field and comma).
field="${line:0:${fieldLen}}"
[[ "${field:0:1}" = '"' ]] || field="\"${field}\""
line="${line:$((fieldLen+1))}"
#line="${line:${fieldLen}}"
#line="${line:1}"
# Append to output line and prepare for next field.
outLine="${outLine}${sep}${field}"; sep=","
fieldLen=$(getNextFieldLen "${line}")
done
# Output built line.
echo "${outLine}"
done
}
而且,如果您想直接从文件中读取(虽然提供一个空文件名或 "-"
将使用标准输入,因此您可能只使用基于文件的函数一切):
function quotifyFile {
file=""
# Empty file or "-" means standard input, otherwise take input from real file.
[[ ${#file} -eq 0 ]] && { quotifyStdIn; return; }
[[ "${file}" = "-" ]] && { quotifyStdIn; return; }
quotifyStdIn < "${file}"
}
最后,因为 每个 不是 "Hello, world" 的程序都值得某种形式的测试工具,这就是您可以用来测试各种功能的工具:
(
echo 'paxdiablo,was here'
echo 'and,"then, strangely,",he,was,not'
echo '50,"My name is ""Pax"", and yours is ""Bob""",42'
echo '17,"""Love"" is grand",19'
) > harness.csv
echo "Before:"
sed "s/^/ /" harness.csv
echo "After:"
quotifyFile harness.csv | sed "s/^/ /"
rm -rf harness.csv
而且,由于除非您 运行 测试,否则测试工具几乎没有用,这里是第一个 运行 的结果:
Before:
paxdiablo,was here
and,"then, strangely,",he,was,not
50,"My name is ""Pax"", and yours is ""Bob""",42
17,"""Love"" is grand",19
After:
"paxdiablo","was here"
"and","then, strangely,","he","was","not"
"50","My name is ""Pax"", and yours is ""Bob""","42"
"17","""Love"" is grand","19"
希望这足以让您在无法安装软件包的情况下继续前进。当然,如果您无法在 bash
中安装其中一个软件包,那么您遇到了我 无法 帮助您解决的问题:-)
您的起始 CSV 不是一个好的 CSV:2 行的列数不同
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 378478 | COMPLETED | Tracfone | - | - | 2020/03/29 09:39:22 | - | 2787 | - | 356074101197544 | 89148000005748235454 | 75176540 |
| 378328 | COMPLETED | Total Wireless | Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB) | 50 | 2020/03/29 | - | - | - | - | - | - |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
使用 Miller (https://github.com/johnkerl/miller) 你可以 运行
mlr --csv --quote-all -N unsparsify input >output
有
"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29","","","","","",""
您可以使用它下载可执行文件https://github.com/johnkerl/miller/releases/tag/v5.7.0