减少 SUMIFS 等价物的执行时间
Reducing execution time for SUMIFS equivalent
我正在尝试重现 Excel 中的函数 SUMIFS,它大约是:accumulation1 =SUMIFS(value; $fin$1:$fin$5; ini$1)
公式的作用:
搜索并累加末尾列表中对应一个ini
的值
计算id3和累加1的例子:
搜索或添加值或 endPoint(ini = 11) 即 id 1 和 id 5 (3+5)=8
的值
然后创建一个新的累加列并重新开始相同的计算(我必须这样做 1004 次..)
id
ini
fin
value
accumulation1
accumulation2
sumOfAccumulation
1
10
11
5
0
0
5
2
9
10
0
0
0
0
3
11
12
2
8
0
10
4
12
13
1
2
8
11
5
05
11
3
0
0
3
我现在有如下所示的累积代码:
connection = psycopg2.connect(dbname=DB_NAME,user=DB_USER, password=DB_PWD, host=DB_HOST, port=DB_PORT)
cursor = connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
data = pdsql.read_sql_query("select id_bdcarth, id_nd_ini::int ini, id_nd_fin::int fin, v from tempturbi.tmp_somme_v19",connection)
Endtest=1
#loop until Endtest = 0 :
#create a new column accumulation
for i in data.ini:
acc=[]
acc=data.v.loc[data.fin==i] # get values of the upstream segments
acc=sum(acc)
#save acc in accumulation
Endtest=data.sum(accumulation)
print("--- %s seconds ---" % (time.time() - start_time))
并且在不保存计算结果的情况下,脚本需要 129 秒才能达到 运行,这比 Excel 慢得多。有什么方法可以改进脚本并使其更快?
我想做的是沿着河流网络行走并计算值:
所以我做了一些修改:
loop = [0,1,2]
#while total != 0:
for total in loop:
z=z+1
acc='acc'+str(z)
# tant que i dans ini
for i in data.ini:
v = data.iloc[:,-1:]#get last column
val = data.v.loc[data.fin==i]
val = sum(val)
#creer colonne et stock valeur
data[acc] = val
print(data[acc].sum())
total=total+1
print(data)
print("--- %s seconds ---" % (time.time() - start_time))
(不影响执行时间)
再次感谢您澄清您的问题。我想我现在明白了,这种方法与您显示的输出相匹配。如果我误解了,请告诉我,如果这比您的方法更快,请告诉我。我不知道会是
import pandas as pd
#Create the test data
df = pd.DataFrame({
'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'ini': {0: 10, 1: 9, 2: 11, 3: 12, 4: 5},
'fin': {0: 11, 1: 10, 2: 12, 3: 13, 4: 11},
'value': {0: 5, 1: 0, 2: 2, 3: 1, 4: 3},
})
#Setup initial values
curr_value_col = 'value'
i = 0
all_value_cols = []
#The groupings stay the same throughout the loops
#so we can just group once and reuse it for speed benefit
gb = df.groupby('fin')
#Loop forever until we break
while True:
#update the loop number and add to the value col list
i += 1
all_value_cols.append(curr_value_col)
#group by fin and sum the value_col values
fin_cumsum = gb[curr_value_col].sum()
#map the sums to the new column
next_val_col = 'accumulation{}'.format(i)
df[next_val_col] = df['ini'].map(fin_cumsum).fillna(0).astype(int)
#If the new column we added sums to 0, then quit
#(I think this is what you were saying you wanted, but I'm not sure)
curr_value_col = next_val_col
if df[curr_value_col].sum() == 0:
break
#Get the cumulative sum from the list of columns we've been saving
df['sumOfAccumulation'] = df[all_value_cols].sum(axis=1)
df
我正在尝试重现 Excel 中的函数 SUMIFS,它大约是:accumulation1 =SUMIFS(value; $fin$1:$fin$5; ini$1)
公式的作用: 搜索并累加末尾列表中对应一个ini
的值计算id3和累加1的例子: 搜索或添加值或 endPoint(ini = 11) 即 id 1 和 id 5 (3+5)=8
的值然后创建一个新的累加列并重新开始相同的计算(我必须这样做 1004 次..)
id | ini | fin | value | accumulation1 | accumulation2 | sumOfAccumulation |
---|---|---|---|---|---|---|
1 | 10 | 11 | 5 | 0 | 0 | 5 |
2 | 9 | 10 | 0 | 0 | 0 | 0 |
3 | 11 | 12 | 2 | 8 | 0 | 10 |
4 | 12 | 13 | 1 | 2 | 8 | 11 |
5 | 05 | 11 | 3 | 0 | 0 | 3 |
我现在有如下所示的累积代码:
connection = psycopg2.connect(dbname=DB_NAME,user=DB_USER, password=DB_PWD, host=DB_HOST, port=DB_PORT)
cursor = connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
data = pdsql.read_sql_query("select id_bdcarth, id_nd_ini::int ini, id_nd_fin::int fin, v from tempturbi.tmp_somme_v19",connection)
Endtest=1
#loop until Endtest = 0 :
#create a new column accumulation
for i in data.ini:
acc=[]
acc=data.v.loc[data.fin==i] # get values of the upstream segments
acc=sum(acc)
#save acc in accumulation
Endtest=data.sum(accumulation)
print("--- %s seconds ---" % (time.time() - start_time))
并且在不保存计算结果的情况下,脚本需要 129 秒才能达到 运行,这比 Excel 慢得多。有什么方法可以改进脚本并使其更快?
我想做的是沿着河流网络行走并计算值:
所以我做了一些修改:
loop = [0,1,2]
#while total != 0:
for total in loop:
z=z+1
acc='acc'+str(z)
# tant que i dans ini
for i in data.ini:
v = data.iloc[:,-1:]#get last column
val = data.v.loc[data.fin==i]
val = sum(val)
#creer colonne et stock valeur
data[acc] = val
print(data[acc].sum())
total=total+1
print(data)
print("--- %s seconds ---" % (time.time() - start_time))
(不影响执行时间)
再次感谢您澄清您的问题。我想我现在明白了,这种方法与您显示的输出相匹配。如果我误解了,请告诉我,如果这比您的方法更快,请告诉我。我不知道会是
import pandas as pd
#Create the test data
df = pd.DataFrame({
'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'ini': {0: 10, 1: 9, 2: 11, 3: 12, 4: 5},
'fin': {0: 11, 1: 10, 2: 12, 3: 13, 4: 11},
'value': {0: 5, 1: 0, 2: 2, 3: 1, 4: 3},
})
#Setup initial values
curr_value_col = 'value'
i = 0
all_value_cols = []
#The groupings stay the same throughout the loops
#so we can just group once and reuse it for speed benefit
gb = df.groupby('fin')
#Loop forever until we break
while True:
#update the loop number and add to the value col list
i += 1
all_value_cols.append(curr_value_col)
#group by fin and sum the value_col values
fin_cumsum = gb[curr_value_col].sum()
#map the sums to the new column
next_val_col = 'accumulation{}'.format(i)
df[next_val_col] = df['ini'].map(fin_cumsum).fillna(0).astype(int)
#If the new column we added sums to 0, then quit
#(I think this is what you were saying you wanted, but I'm not sure)
curr_value_col = next_val_col
if df[curr_value_col].sum() == 0:
break
#Get the cumulative sum from the list of columns we've been saving
df['sumOfAccumulation'] = df[all_value_cols].sum(axis=1)
df