如何合并csv文件的连续行
How to merge continuous lines of a csv file
我有一个 csv 文件,它通过视频帧承载某些进程的输出。在文件中,每一行要么是 fire
要么是 none
。每行有 startTime
和 endTime
。现在我只需要从 continuous fires 中聚集并打印一个实例,其 start 和 end 时间.重点是中间的几个none
如果他们的时间在1秒以内也是可以容忍的。所以要明确一点,重点是将更近帧的检测聚集在一起……以某种方式平滑结果。一行 31-35
秒,而不是多个 31-32, 32-33, ...
。
怎么做?
例如,由于 none
间隔在 1s 以内,因此以下所有连续项都被视为单个项。所以我们会得到类似 1,file1,name1,30.6,32.2,fire,0.83
的结果,该分数是所有火线的平均值。
frame_num,uniqueId,title,startTime,endTime,startTime_fmt,object,score
...
10,file1,name1,30.6,30.64,0:00:30,fire,0.914617
11,file1,name1,30.72,30.76,0:00:30,none,0.68788
12,file1,name1,30.84,30.88,0:00:30,fire,0.993345
13,file1,name1,30.96,31,0:00:30,fire,0.991015
14,file1,name1,31.08,31.12,0:00:31,fire,0.983197
15,file1,name1,31.2,31.24,0:00:31,fire,0.979572
16,file1,name1,31.32,31.36,0:00:31,fire,0.985898
17,file1,name1,31.44,31.48,0:00:31,none,0.961606
18,file1,name1,31.56,31.6,0:00:31,none,0.685139
19,file1,name1,31.68,31.72,0:00:31,none,0.458374
20,file1,name1,31.8,31.84,0:00:31,none,0.413711
21,file1,name1,31.92,31.96,0:00:31,none,0.496828
22,file1,name1,32.04,32.08,0:00:32,fire,0.412836
23,file1,name1,32.16,32.2,0:00:32,fire,0.383344
这是我目前的尝试:
with open(filename) as fin:
lastWasFire=False
for line in fin:
if "fire" in line:
if lastWasFire==False and line !="" and line.split(",")[5] != lastline.split(",")[5]:
fout.write(line)
else:
lastWasFire=False
lastline=line
我假设您不想使用外部库进行数据处理,例如 numpy
或 pandas
。以下代码应该与您的尝试非常相似:
threshold = 1.0
# We will chain a "none" object at the end which triggers the threshold to make sure no "fire" objects are left unprinted
from itertools import chain
trigger = (",,,0,{},,none,".format(threshold + 1),)
# Keys for columns of input data
keys = (
"frame_num",
"uniqueId",
"title",
"startTime",
"endTime",
"startTime_fmt",
"object",
"score",
)
# Store last "fire" or "none" objects
last = {
"fire": [],
"none": [],
}
with open(filename) as f:
# Skip first line of input file
next(f)
for line in chain(f, trigger):
line = dict(zip(keys, line.split(",")))
last[line["object"]].append(line)
# Check threshold for "none" objects if there are previous unprinted "fire" objects
if line["object"] == "none" and last["fire"]:
if float(last["none"][-1]["endTime"]) - float(last["none"][0]["startTime"]) > threshold:
print("{},{},{},{},{},{},{},{}".format(
last["fire"][0]["frame_num"],
last["fire"][0]["uniqueId"],
last["fire"][0]["title"],
last["fire"][0]["startTime"],
last["fire"][-1]["endTime"],
last["fire"][0]["startTime_fmt"],
last["fire"][0]["object"],
sum([float(x["score"]) for x in last["fire"]]) / len(last["fire"]),
))
last["fire"] = []
# Previous "none" objects don't matter anymore as soon as a "fire" object is being encountered
if line["object"] == "fire":
last["none"] = []
正在逐行处理输入文件,"fire"
个对象正在 last["fire"]
中累积。如果
,它们将被合并并打印
last["none"]
中的 "none"
个对象达到 threshold
中定义的阈值
或由于手动链接 trigger
对象而到达输入文件末尾时,该对象是长度为 threshold + 1
的 "none"
对象,因此触发阈值和随后的合并和打印。
当然,您可以将 print
替换为写入输出文件的调用。
这与您正在寻找的很接近,可能是一个可以接受的替代方案。
如果您的采样率相当稳定(看起来大约为 0.12 秒或 50 赫兹),那么您可以找到您可以容忍的等效采样数 'none'
。假设是 8。
此代码将读入数据并用最后一个有效值中的最多 8 个值填充 'none' 值。
import numpy as np
import pandas as pd
def groups_of_true_values(x):
"""Returns array of integers where each True value in x
is replaced by the count of the group of consecutive
True values that it was found in.
"""
return (np.diff(np.concatenate(([0], np.array(x, dtype=int)))) == 1).cumsum()*x
df = pd.read_csv('test.csv', index_col=0)
# Forward-fill the 'none' values to a limit
df['filled'] = df['object'].replace('none', None).fillna(method='ffill', limit=8)
# Find the groups of consecutive fire values
df['group'] = groups_of_true_values(df['filled'] == 'fire')
# Produce sum of scores by group
group_scores = df[['group', 'score']].groupby('group').sum()
print(group_scores)
# Find firing start and stop times
df['start'] = ((df['filled'] == 'fire') & (df['filled'].shift(1) == 'none'))
df['stop'] = ((df['filled'] == 'none') & (df['filled'].shift(1) == 'fire'))
start_times = df.loc[df['start'], 'startTime'].to_list()
stop_times = df.loc[df['stop'], 'startTime'].to_list()
print(start_times, stop_times)
输出:
score
group
1 10.347362
[] []
希望如果有更长的不触发序列,输出会更有趣...
我的方法,使用pandas
和groupby
:
- 将同一对象(
fire
或none
)的连续线条组合成一个咒语
- 掉落none-持续时间小于1秒的火焰法术
- 将同一对象(
fire
或none
)的连续系列法术组合成一个超级法术,并计算相应的分数
我假设数据是按时间排序的(否则我们需要在读取数据后添加排序)。将同一对象的连续行组合成 spells/superspells 的技巧是:首先,确定新的 spell/superspell 开始的位置(即对象类型更改时),其次,为每个法术分配一个唯一的 id( = 之前的新法术数)
import pandas as pd
# preparing the test data
data = '''frame_num,uniqueId,title,startTime,endTime,startTime_fmt,object,score
10,file1,name1,30.6,30.64,0:00:30,fire,0.914617
11,file1,name1,30.72,30.76,0:00:30,none,0.68788
12,file1,name1,30.84,30.88,0:00:30,fire,0.993345
13,file1,name1,30.96,31,0:00:30,fire,0.991015
14,file1,name1,31.08,31.12,0:00:31,fire,0.983197
15,file1,name1,31.2,31.24,0:00:31,fire,0.979572
16,file1,name1,31.32,31.36,0:00:31,fire,0.985898
17,file1,name1,31.44,31.48,0:00:31,none,0.961606
18,file1,name1,31.56,31.6,0:00:31,none,0.685139
19,file1,name1,31.68,31.72,0:00:31,none,0.458374
20,file1,name1,31.8,31.84,0:00:31,none,0.413711
21,file1,name1,31.92,31.96,0:00:31,none,0.496828
22,file1,name1,32.04,32.08,0:00:32,fire,0.412836
23,file1,name1,32.16,32.2,0:00:32,fire,0.383344'''
with open("a.txt", 'w') as f:
print(data, file=f)
df1 = pd.read_csv("a.txt")
# mark new spell (the start of a series of continuous lines of the same object)
# new spell if the current object is different from the previous object
df1['newspell'] = df1.object != df1.object.shift(1)
# give each spell a unique spell number (equal to the total number of new spell before it)
df1['spellnum'] = df1.newspell.cumsum()
# group lines from the same spell together
spells = df1.groupby(by=["uniqueId", "title", "spellnum", "object"]).agg(
first_frame = ('frame_num', 'min'),
last_frame = ('frame_num', 'max'),
startTime = ('startTime', 'min'),
endTime = ('endTime', 'max'),
totalScore = ('score', 'sum'),
cnt = ('score', 'count')).reset_index()
# remove none-fire spells with duration less than 1
spells = spells[(spells.object == 'fire') | (spells.endTime > spells.startTime + 1)]
# Now group conitnous fire spells into superspells
# mark new superspell
spells['newsuperspell'] = spells.object != spells.object.shift(1)
# give each superspell a unique number
spells['superspellnum'] = spells.newsuperspell.cumsum()
superspells = spells.groupby(by=["uniqueId", "title", "superspellnum", "object"]).agg(
first_frame = ('first_frame', 'min'),
last_frame = ('last_frame', 'max'),
startTime = ('startTime', 'min'),
endTime = ('endTime', 'max'),
totalScore = ('totalScore', 'sum'),
cnt = ('cnt', 'sum')).reset_index()
superspells['score'] = superspells.totalScore/superspells.cnt
superspells.drop(columns=['totalScore', 'cnt'], inplace=True)
print(superspells.to_csv(index=False))
# output
#uniqueId,title,superspellnum,object,first_frame,last_frame,startTime,endTime,score
#file1,name1,1,fire,10,23,30.6,32.2,0.8304779999999999
我有一个 csv 文件,它通过视频帧承载某些进程的输出。在文件中,每一行要么是 fire
要么是 none
。每行有 startTime
和 endTime
。现在我只需要从 continuous fires 中聚集并打印一个实例,其 start 和 end 时间.重点是中间的几个none
如果他们的时间在1秒以内也是可以容忍的。所以要明确一点,重点是将更近帧的检测聚集在一起……以某种方式平滑结果。一行 31-35
秒,而不是多个 31-32, 32-33, ...
。
怎么做?
例如,由于 none
间隔在 1s 以内,因此以下所有连续项都被视为单个项。所以我们会得到类似 1,file1,name1,30.6,32.2,fire,0.83
的结果,该分数是所有火线的平均值。
frame_num,uniqueId,title,startTime,endTime,startTime_fmt,object,score
...
10,file1,name1,30.6,30.64,0:00:30,fire,0.914617
11,file1,name1,30.72,30.76,0:00:30,none,0.68788
12,file1,name1,30.84,30.88,0:00:30,fire,0.993345
13,file1,name1,30.96,31,0:00:30,fire,0.991015
14,file1,name1,31.08,31.12,0:00:31,fire,0.983197
15,file1,name1,31.2,31.24,0:00:31,fire,0.979572
16,file1,name1,31.32,31.36,0:00:31,fire,0.985898
17,file1,name1,31.44,31.48,0:00:31,none,0.961606
18,file1,name1,31.56,31.6,0:00:31,none,0.685139
19,file1,name1,31.68,31.72,0:00:31,none,0.458374
20,file1,name1,31.8,31.84,0:00:31,none,0.413711
21,file1,name1,31.92,31.96,0:00:31,none,0.496828
22,file1,name1,32.04,32.08,0:00:32,fire,0.412836
23,file1,name1,32.16,32.2,0:00:32,fire,0.383344
这是我目前的尝试:
with open(filename) as fin:
lastWasFire=False
for line in fin:
if "fire" in line:
if lastWasFire==False and line !="" and line.split(",")[5] != lastline.split(",")[5]:
fout.write(line)
else:
lastWasFire=False
lastline=line
我假设您不想使用外部库进行数据处理,例如 numpy
或 pandas
。以下代码应该与您的尝试非常相似:
threshold = 1.0
# We will chain a "none" object at the end which triggers the threshold to make sure no "fire" objects are left unprinted
from itertools import chain
trigger = (",,,0,{},,none,".format(threshold + 1),)
# Keys for columns of input data
keys = (
"frame_num",
"uniqueId",
"title",
"startTime",
"endTime",
"startTime_fmt",
"object",
"score",
)
# Store last "fire" or "none" objects
last = {
"fire": [],
"none": [],
}
with open(filename) as f:
# Skip first line of input file
next(f)
for line in chain(f, trigger):
line = dict(zip(keys, line.split(",")))
last[line["object"]].append(line)
# Check threshold for "none" objects if there are previous unprinted "fire" objects
if line["object"] == "none" and last["fire"]:
if float(last["none"][-1]["endTime"]) - float(last["none"][0]["startTime"]) > threshold:
print("{},{},{},{},{},{},{},{}".format(
last["fire"][0]["frame_num"],
last["fire"][0]["uniqueId"],
last["fire"][0]["title"],
last["fire"][0]["startTime"],
last["fire"][-1]["endTime"],
last["fire"][0]["startTime_fmt"],
last["fire"][0]["object"],
sum([float(x["score"]) for x in last["fire"]]) / len(last["fire"]),
))
last["fire"] = []
# Previous "none" objects don't matter anymore as soon as a "fire" object is being encountered
if line["object"] == "fire":
last["none"] = []
正在逐行处理输入文件,"fire"
个对象正在 last["fire"]
中累积。如果
last["none"]
中的"none"
个对象达到threshold
中定义的阈值
或由于手动链接
trigger
对象而到达输入文件末尾时,该对象是长度为threshold + 1
的"none"
对象,因此触发阈值和随后的合并和打印。
当然,您可以将 print
替换为写入输出文件的调用。
这与您正在寻找的很接近,可能是一个可以接受的替代方案。
如果您的采样率相当稳定(看起来大约为 0.12 秒或 50 赫兹),那么您可以找到您可以容忍的等效采样数 'none'
。假设是 8。
此代码将读入数据并用最后一个有效值中的最多 8 个值填充 'none' 值。
import numpy as np
import pandas as pd
def groups_of_true_values(x):
"""Returns array of integers where each True value in x
is replaced by the count of the group of consecutive
True values that it was found in.
"""
return (np.diff(np.concatenate(([0], np.array(x, dtype=int)))) == 1).cumsum()*x
df = pd.read_csv('test.csv', index_col=0)
# Forward-fill the 'none' values to a limit
df['filled'] = df['object'].replace('none', None).fillna(method='ffill', limit=8)
# Find the groups of consecutive fire values
df['group'] = groups_of_true_values(df['filled'] == 'fire')
# Produce sum of scores by group
group_scores = df[['group', 'score']].groupby('group').sum()
print(group_scores)
# Find firing start and stop times
df['start'] = ((df['filled'] == 'fire') & (df['filled'].shift(1) == 'none'))
df['stop'] = ((df['filled'] == 'none') & (df['filled'].shift(1) == 'fire'))
start_times = df.loc[df['start'], 'startTime'].to_list()
stop_times = df.loc[df['stop'], 'startTime'].to_list()
print(start_times, stop_times)
输出:
score
group
1 10.347362
[] []
希望如果有更长的不触发序列,输出会更有趣...
我的方法,使用pandas
和groupby
:
- 将同一对象(
fire
或none
)的连续线条组合成一个咒语 - 掉落none-持续时间小于1秒的火焰法术
- 将同一对象(
fire
或none
)的连续系列法术组合成一个超级法术,并计算相应的分数
我假设数据是按时间排序的(否则我们需要在读取数据后添加排序)。将同一对象的连续行组合成 spells/superspells 的技巧是:首先,确定新的 spell/superspell 开始的位置(即对象类型更改时),其次,为每个法术分配一个唯一的 id( = 之前的新法术数)
import pandas as pd
# preparing the test data
data = '''frame_num,uniqueId,title,startTime,endTime,startTime_fmt,object,score
10,file1,name1,30.6,30.64,0:00:30,fire,0.914617
11,file1,name1,30.72,30.76,0:00:30,none,0.68788
12,file1,name1,30.84,30.88,0:00:30,fire,0.993345
13,file1,name1,30.96,31,0:00:30,fire,0.991015
14,file1,name1,31.08,31.12,0:00:31,fire,0.983197
15,file1,name1,31.2,31.24,0:00:31,fire,0.979572
16,file1,name1,31.32,31.36,0:00:31,fire,0.985898
17,file1,name1,31.44,31.48,0:00:31,none,0.961606
18,file1,name1,31.56,31.6,0:00:31,none,0.685139
19,file1,name1,31.68,31.72,0:00:31,none,0.458374
20,file1,name1,31.8,31.84,0:00:31,none,0.413711
21,file1,name1,31.92,31.96,0:00:31,none,0.496828
22,file1,name1,32.04,32.08,0:00:32,fire,0.412836
23,file1,name1,32.16,32.2,0:00:32,fire,0.383344'''
with open("a.txt", 'w') as f:
print(data, file=f)
df1 = pd.read_csv("a.txt")
# mark new spell (the start of a series of continuous lines of the same object)
# new spell if the current object is different from the previous object
df1['newspell'] = df1.object != df1.object.shift(1)
# give each spell a unique spell number (equal to the total number of new spell before it)
df1['spellnum'] = df1.newspell.cumsum()
# group lines from the same spell together
spells = df1.groupby(by=["uniqueId", "title", "spellnum", "object"]).agg(
first_frame = ('frame_num', 'min'),
last_frame = ('frame_num', 'max'),
startTime = ('startTime', 'min'),
endTime = ('endTime', 'max'),
totalScore = ('score', 'sum'),
cnt = ('score', 'count')).reset_index()
# remove none-fire spells with duration less than 1
spells = spells[(spells.object == 'fire') | (spells.endTime > spells.startTime + 1)]
# Now group conitnous fire spells into superspells
# mark new superspell
spells['newsuperspell'] = spells.object != spells.object.shift(1)
# give each superspell a unique number
spells['superspellnum'] = spells.newsuperspell.cumsum()
superspells = spells.groupby(by=["uniqueId", "title", "superspellnum", "object"]).agg(
first_frame = ('first_frame', 'min'),
last_frame = ('last_frame', 'max'),
startTime = ('startTime', 'min'),
endTime = ('endTime', 'max'),
totalScore = ('totalScore', 'sum'),
cnt = ('cnt', 'sum')).reset_index()
superspells['score'] = superspells.totalScore/superspells.cnt
superspells.drop(columns=['totalScore', 'cnt'], inplace=True)
print(superspells.to_csv(index=False))
# output
#uniqueId,title,superspellnum,object,first_frame,last_frame,startTime,endTime,score
#file1,name1,1,fire,10,23,30.6,32.2,0.8304779999999999