如果特定字符多次出现，是否可以从 csv 中截取一行？

Question

我正在处理 csv，但有些行有太多分隔符（由于从旧数据库导出的错误），因此无法导入我的 postgresql 数据库。

我的目标是循环 csv 的每一行并检查它是否具有所需的确切数量的分隔符（在我的例子中是 ;）。如果某行不正确，我会从 csv 中将其剪切并粘贴到另一个文件中。

我尝试了一些 bash 脚本，但由于我缺乏使用正则表达式的技能，所以无法正确使用。

Answer 1

你有能力修复原始数据库导出中的错误吗？我想不是，但值得一问。软件领域有一句古老的格言（就像在许多其他领域一样）：“治愈疾病，而不是症状”。如果您有能力解决根本原因，那么几乎总是最好这样做而不是解决问题。

就是说，做你想做的事情的一种方法是逐行读取输入文件，然后去掉所有 non-separator 个字符并计算剩下的内容：

#!/bin/bash

# Set these variables to your required values:
input_file="filename.csv"
output_file="filename.out.csv"
cut_file="filename.cut.csv"
required_count=10

# Make sure your output files start empty:
cat /dev/null > ${output_file}
cat /dev/null > ${cut_file}

while read line; do
   separators=${line//[!;]/}
   if [ ${#separators} -ne ${required_count} ]; then
     echo "${line}" >> ${cut_file}
   else
     echo "${line}" >> ${output_file}
   fi
done < ${input_file}

这里的神奇之处在于在两个地方使用了“参数expansion/substitution”：

${line//[!;]/}

这意味着：

${line : 使用变量 'line'...
// : 查找 all 个实例...
[!;] : 任何不是';'的字符...
/ : 将其替换为...
：空字符串...
} : 完成

和

${#separators}是变量中的字符数'separators'.

参数扩展和替换可能很难理解，但非常强大。非常值得研究所有的变化。我发现这些页面非常有用：

GNU Bash Manual

Linux Documentation Project

参数扩展和替换是有效的，因为它发生在 shell 内部并且不需要为运行外部实用程序（例如 'sed' 或 'cut').在这种情况下它并不那么重要，但是随着您的脚本变得越来越复杂，您将希望尽量减少子进程的数量。

Answer 2

示例数据：

$ cat db.data
a;b;c;this is a good line
d;e;f;this is a good line
g;h;i;this;is;a;bad;line
jkl;another bad line
x;y;z;this is a good line

注释：

'good' 数据包含 4x ; 个分隔字段
'bad' 数据包含 less/more 比 4x ; 分隔字段

一个awk想法：

awk -F';' 'NF!=4 {print [=11=] > "db.bad.data"; next}1' db.data > db.good.data

这会生成：

$ head db.*.data
==> db.bad.data <==
g;h;i;this;is;a;bad;line
jkl;another bad line

==> db.good.data <==
a;b;c;this is a good line
d;e;f;this is a good line
x;y;z;this is a good line

如果特定字符多次出现，是否可以从 csv 中截取一行？

Is it possible to cut a line from a csv if there is multiple appearances of a specific character?

csv

bash

text-processing