将带有虚线的日志读入 pandas 数据框
reading a log with dashed lines into a pandas dataframe
我有一个棘手的日志文件,我希望进入一个干净的 DF。日志格式如下;
===============================================================================
2016/03/28 12:26:45 - Message
-------------------------------------------------------------------------------
2016/03/28 12:26:45 - Message
2016/03/28 12:26:45 - Message
Message
2016/03/28 12:26:45 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:28:30 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
-------------------------------------------------------------------------------
2016/03/28 12:28:40 - Message
===============================================================================
日志以上述模式继续,我的目标是拥有以下数据框;
Time Text
2016/03/28 12:26:45 Message
我已经厌倦了解析'-'上的文件文件并创建一个数据框,并删除虚线。
import pandas as pd
from pandas.compat import StringIO
clean = open(filename).read().remove('-------------------------------------------------------------------------------', '')
clean2 = open(filename).read().replace('===============================================================================', '')
df = pd.read_csv(filename, sep = "\s*\-", names = ["Time", "Text"], engine = "python")
df.Time = pd.to_datetime(df.Time, format='%d/%m/%y %H:%M:%S.%f')
df.Text = df.Text
但是我得到了很多 NaN 列,感谢任何帮助
我认为您可以使用 to_datetime
with errors='coerce'
for replace bad data to NaT
with dropna
删除 Time
列中 NaT
的所有行:
import pandas as pd
from pandas.compat import StringIO
temp=u"""===============================================================================
2016/03/28 12:26:45 - Message
-------------------------------------------------------------------------------
2016/03/28 12:26:45 - Message
2016/03/28 12:26:45 - Message
Message
2016/03/28 12:26:45 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:28:30 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
-------------------------------------------------------------------------------
2016/03/28 12:28:40 - Message
==============================================================================="""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="\s+-\s+", names = ["Time", "Text"], engine = "python")
df.Time = pd.to_datetime(df.Time, errors='coerce')
df.dropna(subset=['Time'], inplace=True)
print (df)
Time Text
1 2016-03-28 12:26:45 Message
3 2016-03-28 12:26:45 Message
4 2016-03-28 12:26:45 Message
6 2016-03-28 12:26:45 Message
7 2016-03-28 12:26:46 Message
8 2016-03-28 12:26:46 Message
9 2016-03-28 12:28:30 Message
10 2016-03-28 12:28:40 Message
11 2016-03-28 12:28:40 Message
12 2016-03-28 12:28:40 Message
14 2016-03-28 12:28:40 Message
@jezrael 的非常好的解决方案的一个更冗长的替代方案如下:
import pandas as pd
infile = "test.txt" #this is your file
df = pd.DataFrame(columns=['Time','Text'])
with open(infile, "r") as inf:
for i, line in enumerate(inf):
line = line.strip()
if line.startswith("-") or line.startswith("="):
pass
else:
if len(line.split("-")) > 1:
df.loc[i] = pd.Series({'Time':line.split("-")[0], 'Text':line.split("-")[1]})
inf.close()
我不确定您是否希望将时间列转换为 pd 时间格式。如果是这样,那么只需添加:
df.Time = pd.to_datetime(df.Time)
脚本结尾
我有一个棘手的日志文件,我希望进入一个干净的 DF。日志格式如下;
===============================================================================
2016/03/28 12:26:45 - Message
-------------------------------------------------------------------------------
2016/03/28 12:26:45 - Message
2016/03/28 12:26:45 - Message
Message
2016/03/28 12:26:45 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:28:30 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
-------------------------------------------------------------------------------
2016/03/28 12:28:40 - Message
===============================================================================
日志以上述模式继续,我的目标是拥有以下数据框;
Time Text
2016/03/28 12:26:45 Message
我已经厌倦了解析'-'上的文件文件并创建一个数据框,并删除虚线。
import pandas as pd
from pandas.compat import StringIO
clean = open(filename).read().remove('-------------------------------------------------------------------------------', '')
clean2 = open(filename).read().replace('===============================================================================', '')
df = pd.read_csv(filename, sep = "\s*\-", names = ["Time", "Text"], engine = "python")
df.Time = pd.to_datetime(df.Time, format='%d/%m/%y %H:%M:%S.%f')
df.Text = df.Text
但是我得到了很多 NaN 列,感谢任何帮助
我认为您可以使用 to_datetime
with errors='coerce'
for replace bad data to NaT
with dropna
删除 Time
列中 NaT
的所有行:
import pandas as pd
from pandas.compat import StringIO
temp=u"""===============================================================================
2016/03/28 12:26:45 - Message
-------------------------------------------------------------------------------
2016/03/28 12:26:45 - Message
2016/03/28 12:26:45 - Message
Message
2016/03/28 12:26:45 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:26:46 - Message
2016/03/28 12:28:30 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
2016/03/28 12:28:40 - Message
-------------------------------------------------------------------------------
2016/03/28 12:28:40 - Message
==============================================================================="""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="\s+-\s+", names = ["Time", "Text"], engine = "python")
df.Time = pd.to_datetime(df.Time, errors='coerce')
df.dropna(subset=['Time'], inplace=True)
print (df)
Time Text
1 2016-03-28 12:26:45 Message
3 2016-03-28 12:26:45 Message
4 2016-03-28 12:26:45 Message
6 2016-03-28 12:26:45 Message
7 2016-03-28 12:26:46 Message
8 2016-03-28 12:26:46 Message
9 2016-03-28 12:28:30 Message
10 2016-03-28 12:28:40 Message
11 2016-03-28 12:28:40 Message
12 2016-03-28 12:28:40 Message
14 2016-03-28 12:28:40 Message
@jezrael 的非常好的解决方案的一个更冗长的替代方案如下:
import pandas as pd
infile = "test.txt" #this is your file
df = pd.DataFrame(columns=['Time','Text'])
with open(infile, "r") as inf:
for i, line in enumerate(inf):
line = line.strip()
if line.startswith("-") or line.startswith("="):
pass
else:
if len(line.split("-")) > 1:
df.loc[i] = pd.Series({'Time':line.split("-")[0], 'Text':line.split("-")[1]})
inf.close()
我不确定您是否希望将时间列转换为 pd 时间格式。如果是这样,那么只需添加:
df.Time = pd.to_datetime(df.Time)
脚本结尾