将固定宽度的文件从文本转换为 csv
convert a fixed width file from text to csv
我有一个文本格式的大数据文件,我想通过指定每列长度将其转换为 csv。
列数 = 5
栏长
[4 2 5 1 1]
观察样本:
aasdfh9013512
ajshdj 2445df
预期输出
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
我会使用 sed
并捕获具有给定长度的组:
$ sed -r 's/^(.{4})(.{2})(.{5})(.{1})(.{1})$/,,,,/' file
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
GNU awk (gawk) 通过 FIELDWIDTHS
直接支持这个,例如:
gawk '=' FIELDWIDTHS='4 2 5 1 1' OFS=, infile
输出:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
如果有人还在寻找解决方案,我已经在python中开发了一个小脚本。只要你有 python 3.5
,它就很容易使用
https://github.com/just10minutes/FixedWidthToDelimited/blob/master/FixedWidthToDelimiter.py
"""
This script will convert Fixed width File into Delimiter File, tried on Python 3.5 only
Sample run: (Order of argument doesnt matter)
python ConvertFixedToDelimiter.py -i SrcFile.txt -o TrgFile.txt -c Config.txt -d "|"
Inputs are as follows
1. Input FIle - Mandatory(Argument -i) - File which has fixed Width data in it
2. Config File - Optional (Argument -c, if not provided will look for Config.txt file on same path, if not present script will not run)
Should have format as
FieldName,fieldLength
eg:
FirstName,10
SecondName,8
Address,30
etc:
3. Output File - Optional (Argument -o, if not provided will be used as InputFIleName plus Delimited.txt)
4. Delimiter - Optional (Argument -d, if not provided default value is "|" (pipe))
"""
from collections import OrderedDict
import argparse
from argparse import ArgumentParser
import os.path
import sys
def slices(s, args):
position = 0
for length in args:
length = int(length)
yield s[position:position + length]
position += length
def extant_file(x):
"""
'Type' for argparse - checks that file exists but does not open.
"""
if not os.path.exists(x):
# Argparse uses the ArgumentTypeError to give a rejection message like:
# error: argument input: x does not exist
raise argparse.ArgumentTypeError("{0} does not exist".format(x))
return x
parser = ArgumentParser(description="Please provide your Inputs as -i InputFile -o OutPutFile -c ConfigFile")
parser.add_argument("-i", dest="InputFile", required=True, help="Provide your Input file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE", type=extant_file)
parser.add_argument("-o", dest="OutputFile", required=False, help="Provide your Output file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE")
parser.add_argument("-c", dest="ConfigFile", required=False, help="Provide your Config file name here,File should have value as fieldName,fieldLength. if file is on different path than where this script resides then provide full path of the file", metavar="FILE",type=extant_file)
parser.add_argument("-d", dest="Delimiter", required=False, help="Provide the delimiter string you want",metavar="STRING", default="|")
args = parser.parse_args()
#Input file madatory
InputFile = args.InputFile
#Delimiter by default "|"
DELIMITER = args.Delimiter
#Output file checks
if args.OutputFile is None:
OutputFile = str(InputFile) + "Delimited.txt"
print ("Setting Ouput file as "+ OutputFile)
else:
OutputFile = args.OutputFile
#Config file check
if args.ConfigFile is None:
if not os.path.exists("Config.txt"):
print ("There is no Config File provided exiting the script")
sys.exit()
else:
ConfigFile = "Config.txt"
print ("Taking Config.txt file on this path as Default Config File")
else:
ConfigFile = args.ConfigFile
fieldNames = []
fieldLength = []
myvars = OrderedDict()
with open(ConfigFile) as myfile:
for line in myfile:
name, var = line.partition(",")[::2]
myvars[name.strip()] = int(var)
for key,value in myvars.items():
fieldNames.append(key)
fieldLength.append(value)
with open(OutputFile, 'w') as f1:
fieldNames = DELIMITER.join(map(str, fieldNames))
f1.write(fieldNames + "\n")
with open(InputFile, 'r') as f:
for line in f:
rec = (list(slices(line, fieldLength)))
myLine = DELIMITER.join(map(str, rec))
f1.write(myLine + "\n")
这是一个适用于常规 awk
的解决方案(不需要 gawk
)。
awk -v OFS=',' '{print substr([=10=],1,4), substr([=10=],5,2), substr([=10=],7,5), substr([=10=],12,1), substr([=10=],13,1)}'
它使用awk的substr
函数来定义每个字段的起始位置和长度。 OFS
定义输出字段分隔符是什么(在本例中为逗号)。
(旁注:这仅在源数据没有任何逗号的情况下有效。如果数据有逗号,则必须将它们转义为正确的 CSV,这超出了这个问题。)
演示:
echo 'aasdfh9013512
ajshdj 2445df' |
awk -v OFS=',' '{print substr([=11=],1,4), substr([=11=],5,2), substr([=11=],7,5), substr([=11=],12,1), substr([=11=],13,1)}'
输出:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
便携式awk
使用适当的 substr 命令生成 awk 脚本
cat cols
4
2
5
1
1
<cols awk '{ print "substr([=12=],"p","")"; cs+=; p=cs+1 }' p=1
输出:
substr([=13=],1,4)
substr([=13=],5,2)
substr([=13=],7,5)
substr([=13=],12,1)
substr([=13=],13,1)
合并行并使其成为有效的 awk 脚本:
<cols awk '{ print "substr([=14=],"p","")"; cs+=; p=cs+1 }' p=1 |
paste -sd, | sed 's/^/{ print /; s/$/ }/'
输出:
{ print substr([=15=],1,4),substr([=15=],5,2),substr([=15=],7,5),substr([=15=],12,1),substr([=15=],13,1) }
将以上内容重定向到一个文件,例如/tmp/t.awk
和 运行 它在输入文件中:
<infile awk -f /tmp/t.awk
输出:
aasd fh 90135 1 2
ajsh dj 2445 d f
或者用逗号作为输出分隔符:
<infile awk -f /tmp/t.awk OFS=,
输出:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
在 awk
中添加处理此问题的通用方法(替代 FIELDSWIDTH 选项)(我们不需要对子字符串位置进行硬编码,这将根据用户提供的位置 numer 在需要逗号的地方工作inserted) 可以如下所示,用 GNU awk
编写和测试。要使用它,我们必须定义值(如样本中显示的 OP),我们需要插入逗号的位置编号,awk
变量名称是 colLength
给出位置编号,它们之间有 space .
awk -v colLengh="4 2 5 1 1" '
BEGIN{
num=split(colLengh,arr,OFS)
}
{
j=sum=0
while(++j<=num){
if(length([=10=])>sum){
sub("^.{"arr[j]+sum"}","&,")
}
sum+=arr[j]+1
}
}
1
' Input_file
解释: 简单的解释就是,在我们需要定义的地方创建名为 colLengh
的 awk
变量在我们需要插入逗号的位置编号。然后在 BEGIN
部分创建数组 arr
,它具有我们需要在其中插入逗号的索引值。
在主程序部分首先在这里取消变量j
和sum
。然后 运行 宁 while
从 j=1 循环直到 j 的值等于 num。在每个 运行 中从当前行的开头替换(如果当前行的长度大于总和,否则执行替换是没有意义的,我在这里进行了额外检查)所有内容 + ,
根据需要。例如:sub
函数在第一次循环时会变成 .{4}
运行s 然后它变成 .{7}
因为它的第 7 个位置我们需要插入逗号等等。所以 sub
将用匹配值 + ,
替换从 start 到 till 生成的数字中的那些字符。最后在这个程序中提到 1
将打印 edited/non-edited 行。
我有一个文本格式的大数据文件,我想通过指定每列长度将其转换为 csv。
列数 = 5
栏长
[4 2 5 1 1]
观察样本:
aasdfh9013512
ajshdj 2445df
预期输出
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
我会使用 sed
并捕获具有给定长度的组:
$ sed -r 's/^(.{4})(.{2})(.{5})(.{1})(.{1})$/,,,,/' file
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
GNU awk (gawk) 通过 FIELDWIDTHS
直接支持这个,例如:
gawk '=' FIELDWIDTHS='4 2 5 1 1' OFS=, infile
输出:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
如果有人还在寻找解决方案,我已经在python中开发了一个小脚本。只要你有 python 3.5
,它就很容易使用https://github.com/just10minutes/FixedWidthToDelimited/blob/master/FixedWidthToDelimiter.py
"""
This script will convert Fixed width File into Delimiter File, tried on Python 3.5 only
Sample run: (Order of argument doesnt matter)
python ConvertFixedToDelimiter.py -i SrcFile.txt -o TrgFile.txt -c Config.txt -d "|"
Inputs are as follows
1. Input FIle - Mandatory(Argument -i) - File which has fixed Width data in it
2. Config File - Optional (Argument -c, if not provided will look for Config.txt file on same path, if not present script will not run)
Should have format as
FieldName,fieldLength
eg:
FirstName,10
SecondName,8
Address,30
etc:
3. Output File - Optional (Argument -o, if not provided will be used as InputFIleName plus Delimited.txt)
4. Delimiter - Optional (Argument -d, if not provided default value is "|" (pipe))
"""
from collections import OrderedDict
import argparse
from argparse import ArgumentParser
import os.path
import sys
def slices(s, args):
position = 0
for length in args:
length = int(length)
yield s[position:position + length]
position += length
def extant_file(x):
"""
'Type' for argparse - checks that file exists but does not open.
"""
if not os.path.exists(x):
# Argparse uses the ArgumentTypeError to give a rejection message like:
# error: argument input: x does not exist
raise argparse.ArgumentTypeError("{0} does not exist".format(x))
return x
parser = ArgumentParser(description="Please provide your Inputs as -i InputFile -o OutPutFile -c ConfigFile")
parser.add_argument("-i", dest="InputFile", required=True, help="Provide your Input file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE", type=extant_file)
parser.add_argument("-o", dest="OutputFile", required=False, help="Provide your Output file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE")
parser.add_argument("-c", dest="ConfigFile", required=False, help="Provide your Config file name here,File should have value as fieldName,fieldLength. if file is on different path than where this script resides then provide full path of the file", metavar="FILE",type=extant_file)
parser.add_argument("-d", dest="Delimiter", required=False, help="Provide the delimiter string you want",metavar="STRING", default="|")
args = parser.parse_args()
#Input file madatory
InputFile = args.InputFile
#Delimiter by default "|"
DELIMITER = args.Delimiter
#Output file checks
if args.OutputFile is None:
OutputFile = str(InputFile) + "Delimited.txt"
print ("Setting Ouput file as "+ OutputFile)
else:
OutputFile = args.OutputFile
#Config file check
if args.ConfigFile is None:
if not os.path.exists("Config.txt"):
print ("There is no Config File provided exiting the script")
sys.exit()
else:
ConfigFile = "Config.txt"
print ("Taking Config.txt file on this path as Default Config File")
else:
ConfigFile = args.ConfigFile
fieldNames = []
fieldLength = []
myvars = OrderedDict()
with open(ConfigFile) as myfile:
for line in myfile:
name, var = line.partition(",")[::2]
myvars[name.strip()] = int(var)
for key,value in myvars.items():
fieldNames.append(key)
fieldLength.append(value)
with open(OutputFile, 'w') as f1:
fieldNames = DELIMITER.join(map(str, fieldNames))
f1.write(fieldNames + "\n")
with open(InputFile, 'r') as f:
for line in f:
rec = (list(slices(line, fieldLength)))
myLine = DELIMITER.join(map(str, rec))
f1.write(myLine + "\n")
这是一个适用于常规 awk
的解决方案(不需要 gawk
)。
awk -v OFS=',' '{print substr([=10=],1,4), substr([=10=],5,2), substr([=10=],7,5), substr([=10=],12,1), substr([=10=],13,1)}'
它使用awk的substr
函数来定义每个字段的起始位置和长度。 OFS
定义输出字段分隔符是什么(在本例中为逗号)。
(旁注:这仅在源数据没有任何逗号的情况下有效。如果数据有逗号,则必须将它们转义为正确的 CSV,这超出了这个问题。)
演示:
echo 'aasdfh9013512
ajshdj 2445df' |
awk -v OFS=',' '{print substr([=11=],1,4), substr([=11=],5,2), substr([=11=],7,5), substr([=11=],12,1), substr([=11=],13,1)}'
输出:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
便携式awk
使用适当的 substr 命令生成 awk 脚本
cat cols
4
2
5
1
1
<cols awk '{ print "substr([=12=],"p","")"; cs+=; p=cs+1 }' p=1
输出:
substr([=13=],1,4)
substr([=13=],5,2)
substr([=13=],7,5)
substr([=13=],12,1)
substr([=13=],13,1)
合并行并使其成为有效的 awk 脚本:
<cols awk '{ print "substr([=14=],"p","")"; cs+=; p=cs+1 }' p=1 |
paste -sd, | sed 's/^/{ print /; s/$/ }/'
输出:
{ print substr([=15=],1,4),substr([=15=],5,2),substr([=15=],7,5),substr([=15=],12,1),substr([=15=],13,1) }
将以上内容重定向到一个文件,例如/tmp/t.awk
和 运行 它在输入文件中:
<infile awk -f /tmp/t.awk
输出:
aasd fh 90135 1 2
ajsh dj 2445 d f
或者用逗号作为输出分隔符:
<infile awk -f /tmp/t.awk OFS=,
输出:
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f
在 awk
中添加处理此问题的通用方法(替代 FIELDSWIDTH 选项)(我们不需要对子字符串位置进行硬编码,这将根据用户提供的位置 numer 在需要逗号的地方工作inserted) 可以如下所示,用 GNU awk
编写和测试。要使用它,我们必须定义值(如样本中显示的 OP),我们需要插入逗号的位置编号,awk
变量名称是 colLength
给出位置编号,它们之间有 space .
awk -v colLengh="4 2 5 1 1" '
BEGIN{
num=split(colLengh,arr,OFS)
}
{
j=sum=0
while(++j<=num){
if(length([=10=])>sum){
sub("^.{"arr[j]+sum"}","&,")
}
sum+=arr[j]+1
}
}
1
' Input_file
解释: 简单的解释就是,在我们需要定义的地方创建名为 colLengh
的 awk
变量在我们需要插入逗号的位置编号。然后在 BEGIN
部分创建数组 arr
,它具有我们需要在其中插入逗号的索引值。
在主程序部分首先在这里取消变量j
和sum
。然后 运行 宁 while
从 j=1 循环直到 j 的值等于 num。在每个 运行 中从当前行的开头替换(如果当前行的长度大于总和,否则执行替换是没有意义的,我在这里进行了额外检查)所有内容 + ,
根据需要。例如:sub
函数在第一次循环时会变成 .{4}
运行s 然后它变成 .{7}
因为它的第 7 个位置我们需要插入逗号等等。所以 sub
将用匹配值 + ,
替换从 start 到 till 生成的数字中的那些字符。最后在这个程序中提到 1
将打印 edited/non-edited 行。