将多个 CSV 行转换为单独的列
Converting Multiple CSV Rows to Individual Columns
我有一个这种格式的 CSV 文件:
#Time,CPU,Data
x,0,a
x,1,b
y,0,c
y,1,d
我想改成这样
#Time,CPU 0 Data,CPU 1 Data
x,a,b
y,c,d
但我不知道系统中会有多少 CPU 个内核(由 CPU 列表示)。我还有多列数据(不仅仅是单一数据列)。
我该怎么做?
示例输入
# hostname,interval,timestamp,CPU,%user,%nice,%system,%iowait,%steal,%idle
hostname,600,2018-07-24 00:10:01 UTC,-1,5.19,0,1.52,0.09,0.13,93.07
hostname,600,2018-07-24 00:10:01 UTC,0,5.37,0,1.58,0.15,0.15,92.76
hostname,600,2018-07-24 00:10:01 UTC,1,8.36,0,1.75,0.08,0.1,89.7
hostname,600,2018-07-24 00:10:01 UTC,2,3.87,0,1.38,0.07,0.12,94.55
hostname,600,2018-07-24 00:10:01 UTC,3,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,-1,5.13,0,1.52,0.08,0.13,93.15
hostname,600,2018-07-24 00:20:01 UTC,0,4.38,0,1.54,0.13,0.15,93.8
hostname,600,2018-07-24 00:20:01 UTC,1,5.23,0,1.49,0.07,0.11,93.09
hostname,600,2018-07-24 00:20:01 UTC,2,5.26,0,1.53,0.07,0.12,93.03
hostname,600,2018-07-24 00:20:01 UTC,3,5.64,0,1.52,0.04,0.12,92.68
这将是此文件的输出:(CPU -1 变成 CPU ALL)(关键值也只是时间戳(主机名和间隔保持不变)
# hostname,interval,timestamp,CPU ALL %user,CPU ALL %nice,CPU ALL %system,CPU ALL %iowait,CPU ALL %steal,CPU ALL %idle,CPU 0 %user,CPU 0 %nice,CPU 0 %system,CPU 0 %iowait,CPU 0 %steal,CPU 0 %idle,CPU 1 %user,CPU 1 %nice,CPU 1 %system,CPU 1 %iowait,CPU 1 %steal,CPU 1 %idle,CPU 2 %user,CPU 2 %nice,CPU 2 %system,CPU 2 %iowait,CPU 2 %steal,CPU 2 %idle,CPU 3 %user,CPU 3 %nice,CPU 3 %system,CPU 3 %iowait,CPU 3 %steal,CPU 3 %idle
hostname,600,2018-07-24 00:10:01 UTC,5.19,0,1.52,0.09,0.13,93.07,5.37,0,1.58,0.15,0.15,92.76,8.36,0,1.75,0.08,0.1,89.7,3.87,0,1.38,0.07,0.12,94.55,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,5.13,0,1.52,0.08,0.13,93.15,4.38,0,1.54,0.13,0.15,93.8,5.23,0,1.49,0.07,0.11,93.09,5.26,0,1.53,0.07,0.12,93.03,5.64,0,1.52,0.04,0.12,92.68
您的问题不明确,并且不包含您发布的 larger/presumably 更现实的样本 CSV 的预期输出,所以我知道您希望得到什么输出,但这至少会向您展示正确的方法:
$ cat tst.awk
BEGIN{
FS = OFS = ","
}
NR==1 {
for (i=1; i<=NF; i++) {
fldName2nmbr[$i] = i
}
tsFldNmbr = fldName2nmbr["timestamp"]
cpuFldNmbr = fldName2nmbr["CPU"]
next
}
{
tsVal = $tsFldNmbr
cpuVal = $cpuFldNmbr
if ( !(seenTs[tsVal]++) ) {
tsVal2nmbr[tsVal] = ++numTss
tsNmbr2val[numTss] = tsVal
}
if ( !(seenCpu[cpuVal]++) ) {
cpuVal2nmbr[cpuVal] = ++numCpus
cpuNmbr2val[numCpus] = cpuVal
}
tsNmbr = tsVal2nmbr[tsVal]
cpuNmbr = cpuVal2nmbr[cpuVal]
cpuData = ""
for (i=1; i<=NF; i++) {
if ( (i != tsFldNmbr) && (i != cpuFldNmbr) ) {
cpuData = (cpuData == "" ? "" : cpuData OFS) $i
}
}
data[tsNmbr,cpuNmbr] = cpuData
}
END {
printf "%s", "timestamp"
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%sCPU %s Data", OFS, cpuNmbr2val[cpuNmbr]
}
print ""
for (tsNmbr=1; tsNmbr<=numTss; tsNmbr++) {
printf "%s", tsNmbr2val[tsNmbr]
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%s\"%s\"", OFS, data[tsNmbr,cpuNmbr]
}
print ""
}
}
.
$ awk -f tst.awk file
timestamp,CPU -1 Data,CPU 0 Data,CPU 1 Data,CPU 2 Data,CPU 3 Data
2018-07-24 00:10:01 UTC,"hostname,600,5.19,0,1.52,0.09,0.13,93.07","hostname,600,5.37,0,1.58,0.15,0.15,92.76","hostname,600,8.36,0,1.75,0.08,0.1,89.7","hostname,600,3.87,0,1.38,0.07,0.12,94.55","hostname,600,3.16,0,1.36,0.05,0.14,95.29"
2018-07-24 00:20:01 UTC,"hostname,600,5.13,0,1.52,0.08,0.13,93.15","hostname,600,4.38,0,1.54,0.13,0.15,93.8","hostname,600,5.23,0,1.49,0.07,0.11,93.09","hostname,600,5.26,0,1.53,0.07,0.12,93.03","hostname,600,5.64,0,1.52,0.04,0.12,92.68"
我将每个 CPU 数据放在双引号内,这样您就可以将其导入 Excel 或类似的,而不必担心子字段之间的逗号。
如果我们假设 CSV 输入文件是根据递增的时间戳排序的,您可以尝试这样的操作:
use feature qw(say);
use strict;
use warnings;
my $fn = 'log.csv';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $header = <$fh>;
my %info;
my @times;
while ( my $line = <$fh> ) {
chomp $line;
my ( $time, $cpu, $data ) = split ",", $line;
push @times, $time if !exists $info{$time};
push @{ $info{$time} }, $data;
}
close $fh;
for my $time (@times) {
say join ",", $time, @{ $info{$time} };
}
输出:
x,a,b
y,c,d
我有一个这种格式的 CSV 文件:
#Time,CPU,Data
x,0,a
x,1,b
y,0,c
y,1,d
我想改成这样
#Time,CPU 0 Data,CPU 1 Data
x,a,b
y,c,d
但我不知道系统中会有多少 CPU 个内核(由 CPU 列表示)。我还有多列数据(不仅仅是单一数据列)。
我该怎么做?
示例输入
# hostname,interval,timestamp,CPU,%user,%nice,%system,%iowait,%steal,%idle
hostname,600,2018-07-24 00:10:01 UTC,-1,5.19,0,1.52,0.09,0.13,93.07
hostname,600,2018-07-24 00:10:01 UTC,0,5.37,0,1.58,0.15,0.15,92.76
hostname,600,2018-07-24 00:10:01 UTC,1,8.36,0,1.75,0.08,0.1,89.7
hostname,600,2018-07-24 00:10:01 UTC,2,3.87,0,1.38,0.07,0.12,94.55
hostname,600,2018-07-24 00:10:01 UTC,3,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,-1,5.13,0,1.52,0.08,0.13,93.15
hostname,600,2018-07-24 00:20:01 UTC,0,4.38,0,1.54,0.13,0.15,93.8
hostname,600,2018-07-24 00:20:01 UTC,1,5.23,0,1.49,0.07,0.11,93.09
hostname,600,2018-07-24 00:20:01 UTC,2,5.26,0,1.53,0.07,0.12,93.03
hostname,600,2018-07-24 00:20:01 UTC,3,5.64,0,1.52,0.04,0.12,92.68
这将是此文件的输出:(CPU -1 变成 CPU ALL)(关键值也只是时间戳(主机名和间隔保持不变)
# hostname,interval,timestamp,CPU ALL %user,CPU ALL %nice,CPU ALL %system,CPU ALL %iowait,CPU ALL %steal,CPU ALL %idle,CPU 0 %user,CPU 0 %nice,CPU 0 %system,CPU 0 %iowait,CPU 0 %steal,CPU 0 %idle,CPU 1 %user,CPU 1 %nice,CPU 1 %system,CPU 1 %iowait,CPU 1 %steal,CPU 1 %idle,CPU 2 %user,CPU 2 %nice,CPU 2 %system,CPU 2 %iowait,CPU 2 %steal,CPU 2 %idle,CPU 3 %user,CPU 3 %nice,CPU 3 %system,CPU 3 %iowait,CPU 3 %steal,CPU 3 %idle
hostname,600,2018-07-24 00:10:01 UTC,5.19,0,1.52,0.09,0.13,93.07,5.37,0,1.58,0.15,0.15,92.76,8.36,0,1.75,0.08,0.1,89.7,3.87,0,1.38,0.07,0.12,94.55,3.16,0,1.36,0.05,0.14,95.29
hostname,600,2018-07-24 00:20:01 UTC,5.13,0,1.52,0.08,0.13,93.15,4.38,0,1.54,0.13,0.15,93.8,5.23,0,1.49,0.07,0.11,93.09,5.26,0,1.53,0.07,0.12,93.03,5.64,0,1.52,0.04,0.12,92.68
您的问题不明确,并且不包含您发布的 larger/presumably 更现实的样本 CSV 的预期输出,所以我知道您希望得到什么输出,但这至少会向您展示正确的方法:
$ cat tst.awk
BEGIN{
FS = OFS = ","
}
NR==1 {
for (i=1; i<=NF; i++) {
fldName2nmbr[$i] = i
}
tsFldNmbr = fldName2nmbr["timestamp"]
cpuFldNmbr = fldName2nmbr["CPU"]
next
}
{
tsVal = $tsFldNmbr
cpuVal = $cpuFldNmbr
if ( !(seenTs[tsVal]++) ) {
tsVal2nmbr[tsVal] = ++numTss
tsNmbr2val[numTss] = tsVal
}
if ( !(seenCpu[cpuVal]++) ) {
cpuVal2nmbr[cpuVal] = ++numCpus
cpuNmbr2val[numCpus] = cpuVal
}
tsNmbr = tsVal2nmbr[tsVal]
cpuNmbr = cpuVal2nmbr[cpuVal]
cpuData = ""
for (i=1; i<=NF; i++) {
if ( (i != tsFldNmbr) && (i != cpuFldNmbr) ) {
cpuData = (cpuData == "" ? "" : cpuData OFS) $i
}
}
data[tsNmbr,cpuNmbr] = cpuData
}
END {
printf "%s", "timestamp"
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%sCPU %s Data", OFS, cpuNmbr2val[cpuNmbr]
}
print ""
for (tsNmbr=1; tsNmbr<=numTss; tsNmbr++) {
printf "%s", tsNmbr2val[tsNmbr]
for (cpuNmbr=1; cpuNmbr<=numCpus; cpuNmbr++) {
printf "%s\"%s\"", OFS, data[tsNmbr,cpuNmbr]
}
print ""
}
}
.
$ awk -f tst.awk file
timestamp,CPU -1 Data,CPU 0 Data,CPU 1 Data,CPU 2 Data,CPU 3 Data
2018-07-24 00:10:01 UTC,"hostname,600,5.19,0,1.52,0.09,0.13,93.07","hostname,600,5.37,0,1.58,0.15,0.15,92.76","hostname,600,8.36,0,1.75,0.08,0.1,89.7","hostname,600,3.87,0,1.38,0.07,0.12,94.55","hostname,600,3.16,0,1.36,0.05,0.14,95.29"
2018-07-24 00:20:01 UTC,"hostname,600,5.13,0,1.52,0.08,0.13,93.15","hostname,600,4.38,0,1.54,0.13,0.15,93.8","hostname,600,5.23,0,1.49,0.07,0.11,93.09","hostname,600,5.26,0,1.53,0.07,0.12,93.03","hostname,600,5.64,0,1.52,0.04,0.12,92.68"
我将每个 CPU 数据放在双引号内,这样您就可以将其导入 Excel 或类似的,而不必担心子字段之间的逗号。
如果我们假设 CSV 输入文件是根据递增的时间戳排序的,您可以尝试这样的操作:
use feature qw(say);
use strict;
use warnings;
my $fn = 'log.csv';
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $header = <$fh>;
my %info;
my @times;
while ( my $line = <$fh> ) {
chomp $line;
my ( $time, $cpu, $data ) = split ",", $line;
push @times, $time if !exists $info{$time};
push @{ $info{$time} }, $data;
}
close $fh;
for my $time (@times) {
say join ",", $time, @{ $info{$time} };
}
输出:
x,a,b
y,c,d