不以符号开头的尾文件
Tail file that does not start with symbol
我有一组原始 csv 文件,除了列名外,还有注释 headers(#
符号)。像这样:
# This data is taken from ....
# ...
# ...
# ...
# col1,col2,...,coln
#
[csv data rows starts here]
每个文件中包含列名的行上方的行数不固定。
如何'cut' 将输出标准 CSV 格式的文件(创建一个新文件)?
col1,col2,...,coln
[csv data rows starts here]
我正在使用 Jupyter notebook 进行一些数据整理,所以我有兴趣使用内联 shell 脚本(也许使用 tail
)和 Python 来做这件事。 =14=]
下面是一个可以在您的 Jupyter notebook 中使用的 Python 版本。您需要将 file_names = ["<file_name1>","<file_name2>"]
行中列出的 <file_name>
替换为您的。
import os
import sys
import pandas as pd
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def mine_header(fn):
'''
To answer
Takes a file name as input
Assumes last commented line with contents before the data rows start
contains the column names. Could be condensed, to read in all text once and
then rsplit on last `#` but going line by line at start offers more
opportunity for customizing later if not quite matching pattern seen in
data files. Also could just assume second last line above the data contains
the column names? In that case, could skip
`header = [x for x in header if x]` line and use
`col_names = header[-2].split(",")` instead.
Returns list of column names and rest of contents of csv file beyond
header.
'''
# first copy the input file that will be parsed line by to a new file so
# can parse contents while possibly overwriting the input file with a
# shorter version if a label for a set encountered inside it
beyond_header = False
header = [] # collect the header lines
data_rows = "" # collect the data rows
# go through the file line by line until beyond commented out header
with open(fn, 'r') as input:
for line in input:
if beyond_header:
data_rows += line
elif line.startswith("#"):
header.append(line[1:].strip()) # leave off comment symbol and
# remove any leadding and trailing whitespace
# If line doesn't start with comment symbol, have hit the end of
# the header and want to start collecting the csv data tows
else:
data_rows += line
beyond_header = False
# Now process the header lines to get the column names.
header = [x for x in header if x]# The last row before the data should be
# empty now and so that list comprehension should remove it leaving last row
# as the one with the column names
col_names = header[-1].split(",")
return col_names, data_rows
file_names = ["<file_name1>","<file_name2>"]
df_dict = {}
for i,fn in enumerate(file_names):
col_names, data_rows = mine_header(fn)
df_dict[i] = pd.read_csv(StringIO(data_rows), header=0, names=col_names)
# display the produced dataframes
from IPython.display import display, HTML
for df in df_dict:
display(df)
每个文件的 pandas 数据框都可以通过与您创建的文件列表匹配的索引访问。例如,由第三个 csv 文件制作的数据框将是 df_dict[2]
.
我有点超出了您的要求,因为将列拆分为列表很容易设计到挖掘功能中,并且 Pandas 设置为处理此后的所有内容。
如果你真的想将输出作为标准 CSV,你可以使用 col_names
和 col_names, data_rows = mine_header(fn)
返回的 data_rows
并保存一个 CSV 文件。您可以将它们组合成一个字符串来保存,如下所示:
col_names_as_string = ",".join(col_names)
string_to_save = col_names_as_string + "\n" + data_rows
我有一组原始 csv 文件,除了列名外,还有注释 headers(#
符号)。像这样:
# This data is taken from ....
# ...
# ...
# ...
# col1,col2,...,coln
#
[csv data rows starts here]
每个文件中包含列名的行上方的行数不固定。
如何'cut' 将输出标准 CSV 格式的文件(创建一个新文件)?
col1,col2,...,coln
[csv data rows starts here]
我正在使用 Jupyter notebook 进行一些数据整理,所以我有兴趣使用内联 shell 脚本(也许使用 tail
)和 Python 来做这件事。 =14=]
下面是一个可以在您的 Jupyter notebook 中使用的 Python 版本。您需要将 file_names = ["<file_name1>","<file_name2>"]
行中列出的 <file_name>
替换为您的。
import os
import sys
import pandas as pd
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def mine_header(fn):
'''
To answer
Takes a file name as input
Assumes last commented line with contents before the data rows start
contains the column names. Could be condensed, to read in all text once and
then rsplit on last `#` but going line by line at start offers more
opportunity for customizing later if not quite matching pattern seen in
data files. Also could just assume second last line above the data contains
the column names? In that case, could skip
`header = [x for x in header if x]` line and use
`col_names = header[-2].split(",")` instead.
Returns list of column names and rest of contents of csv file beyond
header.
'''
# first copy the input file that will be parsed line by to a new file so
# can parse contents while possibly overwriting the input file with a
# shorter version if a label for a set encountered inside it
beyond_header = False
header = [] # collect the header lines
data_rows = "" # collect the data rows
# go through the file line by line until beyond commented out header
with open(fn, 'r') as input:
for line in input:
if beyond_header:
data_rows += line
elif line.startswith("#"):
header.append(line[1:].strip()) # leave off comment symbol and
# remove any leadding and trailing whitespace
# If line doesn't start with comment symbol, have hit the end of
# the header and want to start collecting the csv data tows
else:
data_rows += line
beyond_header = False
# Now process the header lines to get the column names.
header = [x for x in header if x]# The last row before the data should be
# empty now and so that list comprehension should remove it leaving last row
# as the one with the column names
col_names = header[-1].split(",")
return col_names, data_rows
file_names = ["<file_name1>","<file_name2>"]
df_dict = {}
for i,fn in enumerate(file_names):
col_names, data_rows = mine_header(fn)
df_dict[i] = pd.read_csv(StringIO(data_rows), header=0, names=col_names)
# display the produced dataframes
from IPython.display import display, HTML
for df in df_dict:
display(df)
每个文件的 pandas 数据框都可以通过与您创建的文件列表匹配的索引访问。例如,由第三个 csv 文件制作的数据框将是 df_dict[2]
.
我有点超出了您的要求,因为将列拆分为列表很容易设计到挖掘功能中,并且 Pandas 设置为处理此后的所有内容。
如果你真的想将输出作为标准 CSV,你可以使用 col_names
和 col_names, data_rows = mine_header(fn)
返回的 data_rows
并保存一个 CSV 文件。您可以将它们组合成一个字符串来保存,如下所示:
col_names_as_string = ",".join(col_names)
string_to_save = col_names_as_string + "\n" + data_rows