有效地标记边界框内的点
Efficiently label points inside a bounding box
我有一个问题想要解决,我已经找到了一个有效的代码,但由于我需要处理的数据量很大,所以效率非常低。所以这是我正在尝试做的事情的描述:
我有一个数据框,其中包含货架上产品周围的边界框。因此,每一行都包含有关边界框边界的信息、拍摄照片的相机、拍摄照片的日期和时间以及我计算出的边界框中心。一条信息丢失了它是哪个产品(没有 ID,没有条形码)。
index boundingX0 boundingX1 boundingY0 boundingY1 cameraId \
0 0 3167.0 3276.0 2532.0 2662.0 Z4301160003414164
1 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164
2 2 3278.0 3387.0 2532.0 2663.0 Z4301160003414164
3 3 1264.0 1373.0 946.0 1097.0 Z4301160003414164
4 4 1909.0 2002.0 1983.0 2151.0 Z4301160003414164
5 5 1722.0 1808.0 1982.0 2150.0 Z4301160003414164
6 6 3163.0 3281.0 2301.0 2460.0 Z4301160003414164
7 7 2359.0 2469.0 2512.0 2629.0 Z4301160003414164
8 8 1381.0 1496.0 947.0 1097.0 Z4301160003414164
9 9 1053.0 1172.0 1958.0 2146.0 Z4301160003414164
filename Date Hour facing_center_x facing_center_y
0 A 2022-05-17 13 3221.5 2597.0
1 A 2022-05-17 13 1859.0 2068.5
2 A 2022-05-17 13 3332.5 2597.5
3 A 2022-05-17 13 1318.5 1021.5
4 A 2022-05-17 13 1955.5 2067.0
5 A 2022-05-17 13 1765.0 2066.0
6 A 2022-05-17 13 3222.0 2380.5
7 A 2022-05-17 13 2414.0 2570.5
8 A 2022-05-17 13 1438.5 1022.0
9 A 2022-05-17 13 1112.5 2052.0
然而,我有第二个数据框,其中包含产品应该所在的整个区域的边界框以及有关产品的所有必要信息(id、条形码)以及有关相机、数据、小时和等等。
index Date cameraId filename itemId \
0 0 2022-05-17 Z4301160003414164 A 5.903282e+07
1 1 2022-05-17 Z4301160003414164 A 5.903282e+07
2 2 2022-05-17 Z4301160003414164 A 8.013546e+07
3 3 2022-05-17 Z4301160003414164 A 8.013546e+07
4 4 2022-05-17 Z4301160003414164 A 3.760011e+10
5 5 2022-05-17 Z4301160003414164 A 3.760011e+10
6 6 2022-05-17 Z4301160003414164 A 3.017620e+12
7 7 2022-05-17 Z4301160003414164 A 3.017620e+12
8 8 2022-05-17 Z4301160003414164 A 3.017761e+12
9 9 2022-05-17 Z4301160003414164 A 3.088541e+12
barcode x y boundingX0 boundingX1 boundingY0 \
0 N4131466489013277 2117.0 1828.0 2117.0 3232.0 1540.0
1 N4131466408713275 3233.0 1832.0 3233.0 3995.0 1540.0
2 N4131466510613278 2905.0 1099.0 2905.0 4055.0 846.0
3 N4131465123513276 2921.0 757.0 2921.0 4145.0 457.0
4 N4131466272113278 1684.0 760.0 1684.0 2920.0 460.0
5 N4131465122713277 1212.0 761.0 1212.0 1683.0 461.0
6 N4131465130213271 2127.0 1461.0 2127.0 4013.0 1185.0
7 N4131466226313279 2122.0 2158.0 2122.0 3981.0 1900.0
8 N4141461925413272 4254.0 3081.0 4254.0 4598.0 2769.0
9 N4131465932913278 1323.0 1817.0 1323.0 1478.0 1539.0
boundingY1 Hour
0 1828.0 11
1 1832.0 11
2 1099.0 11
3 757.0 11
4 760.0 11
5 761.0 11
6 1461.0 11
7 2158.0 11
8 3081.0 11
9 1817.0 11
我想要做的是将 facing
中的边界框中心放置在 label
中的产品区域边界框内。如果中心在给定的框中,则将条形码附加到 facing
.
中的数据
我已经这样做了:
facing_index = list(set(facing.index))
label_index = list(set(label.index))
LABEL =[]
for i in range(len(label_index)):
f = label[label.index == i]
cameraId = f.cameraId.iloc[0]
date = f.Date.iloc[0]
hour = f.Hour.iloc[0]
for j in range(len(facing_index)):
g = facing[(facing['cameraId']==cameraId) & (facing['Date']==date) & (facing['Hour']==hour)]
points = [(g['facing_center_x'], g['facing_center_y'])]
pts = np.array(points)
ll = np.array([f['boundingX0'], f['boundingY0']]) # lower-left
ur = np.array([f['boundingX1'], f['boundingY1']]) # upper-right
inidx = np.all(np.logical_and(ll <= pts, pts <= ur), axis=1)
inbox = pts[inidx]
outbox = pts[np.logical_not(inidx)]
if len(inbox)>0:
g['barcode']=f.barcode
else:
0
LABEL.append(g)
LABEL = pd.concat(LABEL)
问题是这需要很长时间,因为 label
包含超过 125,000 行,facing
包含超过 400,000 行。
我尝试的另一种方法是:定义函数
def BoundingBoxContains(rectangle,point):
logic = rectangle[0] < point[0] < rectangle[0]+rectangle[2] and rectangle[1] < point[1] < rectangle[1]+rectangle[3]
return logic
检查点是否在矩形中。那么:
LABEL =[]
for i in range(len(label_index)):
f = label[label.index == i]
BoundingBox = (f.boundingX0[i],f.boundingX1[i],f.boundingY0[i],f.boundingY1[i])
f = f.reset_index()
date = f.Date.iloc[0]
filename = f.filename.iloc[0]
for j in range(len(facing_index)):
g = facing[(facing['Date']==date) & (facing['filename']==filename)].reset_index()
K = len(g)
for k in range(K):
gk = g[g.index==k]
facingCenter = (gk['facing_center_x'][k], gk['facing_center_y'][k])
a = rectContains(BoundingBox, facingCenter)
if a == True:
gk['barcode'] = f.barcode
else:
0
LABEL.append(gk)
给出:
level_0 index boundingX0 boundingX1 boundingY0 boundingY1 \
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
cameraId filename Date Hour facing_center_x \
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
facing_center_y barcode
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
我还没有找到更有效的方法,非常感谢您提供任何见解。
IIUC,我认为你应该先在 Date
、Hour
和 cameraId
列上合并 facing
和 label
数据框,然后应用你的 BoundingBoxContains
函数。
如果你有足够的内存使用 merge
不用任何小心。 apply
部分只是每一行的一个循环。这部分可以使用multiprocessing
真正优化。如果第一部分成功,我可以建议您使用 multiprocessing.Pool
.
实现 MP
现在代码:
def BoundingBoxContains(rectangle, point):
logic = rectangle[0] < point[0] < rectangle[0]+rectangle[2] and rectangle[1] < point[1] < rectangle[1]+rectangle[3]
return logic
bbox_contains = lambda x: BoundingBoxContains((x.boundingX0, x.boundingX1, x.boundingY0, x.boundingY1),
(x.facing_center_x, x.facing_center_y))
cols = ['Date', 'Hour', 'cameraId', 'barcode']
out = facing.merge(label[cols], on=cols[:-1])
out = out.loc[out.apply(bbox_contains, axis=1)]
注意:我必须修改 Hour
(13 -> 11) 才能匹配。
你能解释一下吗?
facing_index = list(set(facing.index))
label_index = list(set(label.index))
输出:
>>> out.drop_duplicates(cols) # if you want to keep only one instance per cols
index boundingX0 boundingX1 boundingY0 boundingY1 cameraId filename Date Hour facing_center_x facing_center_y barcode
10 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131466489013277
11 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131466408713275
12 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131466510613278
13 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131465123513276
14 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131466272113278
15 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131465122713277
16 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131465130213271
17 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131466226313279
18 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4141461925413272
19 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131465932913278
更新 1
你有足够的内存来在 GCP 上创建这个数据帧吗?
cam_cat = pd.CategoricalDtype(np.unique([facing['cameraId'].unique(),
label['cameraId'].unique()]))
df1 = pd.DataFrame({
'DateTime': pd.to_datetime(facing['Date'] + ' ' + facing['Hour'].astype(str)),
'cameraId': facing['cameraId'].astype(cam_cat),
'facing': facing['index']
})
df2 = pd.DataFrame({
'DateTime': pd.to_datetime(label['Date'] + ' ' + label['Hour'].astype(str)),
'cameraId': label['cameraId'].astype(cam_cat),
'label': label['index']
})
# Lighweight merge to use with multiprocessing
dfm = df1.merge(df2, on=['DateTime', 'cameraId'])
更新 2
在使用 multiprocessing
之前,你能检查一下 dfm
在 2-pass 过滤之后的输出吗:
import pandas as pd
import numpy as np
# Vectorized function
def BoundingBoxContains(df):
m1 = df['facing_center_x'].between(df['boundingX0'], df['boundingX0'] + df['boundingY0'])
m2 = df['facing_center_y'].between(df['boundingX1'], df['boundingX1'] + df['boundingY1'])
return m1 & m2
# Your load routine
facing = pd.read_csv('facing.csv')
label = pd.read_csv('label.csv')
# Create a category dtype from cameraId to reduce memory footprint
cam_cat = pd.CategoricalDtype(np.unique([facing['cameraId'].unique(),
label['cameraId'].unique()]))
# Extract real index (not 'index' column) from each dataframes
df1 = pd.DataFrame({
'DateTime': pd.to_datetime(facing['Date'] + ' ' + facing['Hour'].astype(str)),
'cameraId': facing['cameraId'].astype(cam_cat),
'facing': facing.index
})
df2 = pd.DataFrame({
'DateTime': pd.to_datetime(label['Date'] + ' ' + label['Hour'].astype(str)),
'cameraId': label['cameraId'].astype(cam_cat),
'label': label.index
})
# 1st pass: lookup on DateTime and cameraId to keep only possible match
# Cross product of facing / label with valid DateTime / cameraId
dfm = df1.merge(df2, on=['DateTime', 'cameraId'])
CHUNKSIZE = 10 # Chuncksize
facing_cols = ['facing_center_x', 'facing_center_y']
label_cols = ['boundingX0', 'boundingX1', 'boundingY0', 'boundingY1', 'barcode']
# 2nd pass: match facing coords on bounding box
# Filter out the dataframe
mask = []
for i in range(0, len(dfm), CHUNKSIZE):
F = facing.loc[dfm.iloc[i:i+CHUNKSIZE]['facing'], facing_cols].reset_index(drop=True)
L = label.loc[dfm.iloc[i:i+CHUNKSIZE]['label'], label_cols].reset_index(drop=True)
mask.append(BoundingBoxContains(pd.concat([F, L], axis=1)))
dfm = dfm.loc[pd.concat(mask, ignore_index=True)]
输出:
>>> dfm
DateTime cameraId facing label
19 2022-05-17 11:00:00 Z4301160003414164 1 9
49 2022-05-17 11:00:00 Z4301160003414164 4 9
59 2022-05-17 11:00:00 Z4301160003414164 5 9
79 2022-05-17 11:00:00 Z4301160003414164 7 9
更新 3
最后一步是从 dfm
的 facing
和 label
列重建数据帧:
out = facing.loc[dfm['facing']].assign(barcode=label.loc[dfm['label'], 'barcode'].values)
print(out)
# Output
index boundingX0 boundingX1 boundingY0 boundingY1 cameraId \
1 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164
4 4 1909.0 2002.0 1983.0 2151.0 Z4301160003414164
5 5 1722.0 1808.0 1982.0 2150.0 Z4301160003414164
7 7 2359.0 2469.0 2512.0 2629.0 Z4301160003414164
filename Date Hour facing_center_x facing_center_y \
1 A 2022-05-17 11 1859.0 2068.5
4 A 2022-05-17 11 1955.5 2067.0
5 A 2022-05-17 11 1765.0 2066.0
7 A 2022-05-17 11 2414.0 2570.5
barcode
1 N4131465932913278
4 N4131465932913278
5 N4131465932913278
7 N4131465932913278
我有一个问题想要解决,我已经找到了一个有效的代码,但由于我需要处理的数据量很大,所以效率非常低。所以这是我正在尝试做的事情的描述:
我有一个数据框,其中包含货架上产品周围的边界框。因此,每一行都包含有关边界框边界的信息、拍摄照片的相机、拍摄照片的日期和时间以及我计算出的边界框中心。一条信息丢失了它是哪个产品(没有 ID,没有条形码)。
index boundingX0 boundingX1 boundingY0 boundingY1 cameraId \
0 0 3167.0 3276.0 2532.0 2662.0 Z4301160003414164
1 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164
2 2 3278.0 3387.0 2532.0 2663.0 Z4301160003414164
3 3 1264.0 1373.0 946.0 1097.0 Z4301160003414164
4 4 1909.0 2002.0 1983.0 2151.0 Z4301160003414164
5 5 1722.0 1808.0 1982.0 2150.0 Z4301160003414164
6 6 3163.0 3281.0 2301.0 2460.0 Z4301160003414164
7 7 2359.0 2469.0 2512.0 2629.0 Z4301160003414164
8 8 1381.0 1496.0 947.0 1097.0 Z4301160003414164
9 9 1053.0 1172.0 1958.0 2146.0 Z4301160003414164
filename Date Hour facing_center_x facing_center_y
0 A 2022-05-17 13 3221.5 2597.0
1 A 2022-05-17 13 1859.0 2068.5
2 A 2022-05-17 13 3332.5 2597.5
3 A 2022-05-17 13 1318.5 1021.5
4 A 2022-05-17 13 1955.5 2067.0
5 A 2022-05-17 13 1765.0 2066.0
6 A 2022-05-17 13 3222.0 2380.5
7 A 2022-05-17 13 2414.0 2570.5
8 A 2022-05-17 13 1438.5 1022.0
9 A 2022-05-17 13 1112.5 2052.0
然而,我有第二个数据框,其中包含产品应该所在的整个区域的边界框以及有关产品的所有必要信息(id、条形码)以及有关相机、数据、小时和等等。
index Date cameraId filename itemId \
0 0 2022-05-17 Z4301160003414164 A 5.903282e+07
1 1 2022-05-17 Z4301160003414164 A 5.903282e+07
2 2 2022-05-17 Z4301160003414164 A 8.013546e+07
3 3 2022-05-17 Z4301160003414164 A 8.013546e+07
4 4 2022-05-17 Z4301160003414164 A 3.760011e+10
5 5 2022-05-17 Z4301160003414164 A 3.760011e+10
6 6 2022-05-17 Z4301160003414164 A 3.017620e+12
7 7 2022-05-17 Z4301160003414164 A 3.017620e+12
8 8 2022-05-17 Z4301160003414164 A 3.017761e+12
9 9 2022-05-17 Z4301160003414164 A 3.088541e+12
barcode x y boundingX0 boundingX1 boundingY0 \
0 N4131466489013277 2117.0 1828.0 2117.0 3232.0 1540.0
1 N4131466408713275 3233.0 1832.0 3233.0 3995.0 1540.0
2 N4131466510613278 2905.0 1099.0 2905.0 4055.0 846.0
3 N4131465123513276 2921.0 757.0 2921.0 4145.0 457.0
4 N4131466272113278 1684.0 760.0 1684.0 2920.0 460.0
5 N4131465122713277 1212.0 761.0 1212.0 1683.0 461.0
6 N4131465130213271 2127.0 1461.0 2127.0 4013.0 1185.0
7 N4131466226313279 2122.0 2158.0 2122.0 3981.0 1900.0
8 N4141461925413272 4254.0 3081.0 4254.0 4598.0 2769.0
9 N4131465932913278 1323.0 1817.0 1323.0 1478.0 1539.0
boundingY1 Hour
0 1828.0 11
1 1832.0 11
2 1099.0 11
3 757.0 11
4 760.0 11
5 761.0 11
6 1461.0 11
7 2158.0 11
8 3081.0 11
9 1817.0 11
我想要做的是将 facing
中的边界框中心放置在 label
中的产品区域边界框内。如果中心在给定的框中,则将条形码附加到 facing
.
我已经这样做了:
facing_index = list(set(facing.index))
label_index = list(set(label.index))
LABEL =[]
for i in range(len(label_index)):
f = label[label.index == i]
cameraId = f.cameraId.iloc[0]
date = f.Date.iloc[0]
hour = f.Hour.iloc[0]
for j in range(len(facing_index)):
g = facing[(facing['cameraId']==cameraId) & (facing['Date']==date) & (facing['Hour']==hour)]
points = [(g['facing_center_x'], g['facing_center_y'])]
pts = np.array(points)
ll = np.array([f['boundingX0'], f['boundingY0']]) # lower-left
ur = np.array([f['boundingX1'], f['boundingY1']]) # upper-right
inidx = np.all(np.logical_and(ll <= pts, pts <= ur), axis=1)
inbox = pts[inidx]
outbox = pts[np.logical_not(inidx)]
if len(inbox)>0:
g['barcode']=f.barcode
else:
0
LABEL.append(g)
LABEL = pd.concat(LABEL)
问题是这需要很长时间,因为 label
包含超过 125,000 行,facing
包含超过 400,000 行。
我尝试的另一种方法是:定义函数
def BoundingBoxContains(rectangle,point):
logic = rectangle[0] < point[0] < rectangle[0]+rectangle[2] and rectangle[1] < point[1] < rectangle[1]+rectangle[3]
return logic
检查点是否在矩形中。那么:
LABEL =[]
for i in range(len(label_index)):
f = label[label.index == i]
BoundingBox = (f.boundingX0[i],f.boundingX1[i],f.boundingY0[i],f.boundingY1[i])
f = f.reset_index()
date = f.Date.iloc[0]
filename = f.filename.iloc[0]
for j in range(len(facing_index)):
g = facing[(facing['Date']==date) & (facing['filename']==filename)].reset_index()
K = len(g)
for k in range(K):
gk = g[g.index==k]
facingCenter = (gk['facing_center_x'][k], gk['facing_center_y'][k])
a = rectContains(BoundingBox, facingCenter)
if a == True:
gk['barcode'] = f.barcode
else:
0
LABEL.append(gk)
给出:
level_0 index boundingX0 boundingX1 boundingY0 boundingY1 \
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
0 0 0 3167.0 3276.0 2532.0 2662.0
cameraId filename Date Hour facing_center_x \
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
0 Z4301160003414164 A 2022-05-17 13 3221.5
facing_center_y barcode
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
0 2597.0 N4131465122613276
我还没有找到更有效的方法,非常感谢您提供任何见解。
IIUC,我认为你应该先在 Date
、Hour
和 cameraId
列上合并 facing
和 label
数据框,然后应用你的 BoundingBoxContains
函数。
如果你有足够的内存使用 merge
不用任何小心。 apply
部分只是每一行的一个循环。这部分可以使用multiprocessing
真正优化。如果第一部分成功,我可以建议您使用 multiprocessing.Pool
.
现在代码:
def BoundingBoxContains(rectangle, point):
logic = rectangle[0] < point[0] < rectangle[0]+rectangle[2] and rectangle[1] < point[1] < rectangle[1]+rectangle[3]
return logic
bbox_contains = lambda x: BoundingBoxContains((x.boundingX0, x.boundingX1, x.boundingY0, x.boundingY1),
(x.facing_center_x, x.facing_center_y))
cols = ['Date', 'Hour', 'cameraId', 'barcode']
out = facing.merge(label[cols], on=cols[:-1])
out = out.loc[out.apply(bbox_contains, axis=1)]
注意:我必须修改 Hour
(13 -> 11) 才能匹配。
你能解释一下吗?
facing_index = list(set(facing.index))
label_index = list(set(label.index))
输出:
>>> out.drop_duplicates(cols) # if you want to keep only one instance per cols
index boundingX0 boundingX1 boundingY0 boundingY1 cameraId filename Date Hour facing_center_x facing_center_y barcode
10 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131466489013277
11 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131466408713275
12 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131466510613278
13 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131465123513276
14 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131466272113278
15 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131465122713277
16 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131465130213271
17 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131466226313279
18 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4141461925413272
19 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164 A 2022-05-17 11 1859.0 2068.5 N4131465932913278
更新 1
你有足够的内存来在 GCP 上创建这个数据帧吗?
cam_cat = pd.CategoricalDtype(np.unique([facing['cameraId'].unique(),
label['cameraId'].unique()]))
df1 = pd.DataFrame({
'DateTime': pd.to_datetime(facing['Date'] + ' ' + facing['Hour'].astype(str)),
'cameraId': facing['cameraId'].astype(cam_cat),
'facing': facing['index']
})
df2 = pd.DataFrame({
'DateTime': pd.to_datetime(label['Date'] + ' ' + label['Hour'].astype(str)),
'cameraId': label['cameraId'].astype(cam_cat),
'label': label['index']
})
# Lighweight merge to use with multiprocessing
dfm = df1.merge(df2, on=['DateTime', 'cameraId'])
更新 2
在使用 multiprocessing
之前,你能检查一下 dfm
在 2-pass 过滤之后的输出吗:
import pandas as pd
import numpy as np
# Vectorized function
def BoundingBoxContains(df):
m1 = df['facing_center_x'].between(df['boundingX0'], df['boundingX0'] + df['boundingY0'])
m2 = df['facing_center_y'].between(df['boundingX1'], df['boundingX1'] + df['boundingY1'])
return m1 & m2
# Your load routine
facing = pd.read_csv('facing.csv')
label = pd.read_csv('label.csv')
# Create a category dtype from cameraId to reduce memory footprint
cam_cat = pd.CategoricalDtype(np.unique([facing['cameraId'].unique(),
label['cameraId'].unique()]))
# Extract real index (not 'index' column) from each dataframes
df1 = pd.DataFrame({
'DateTime': pd.to_datetime(facing['Date'] + ' ' + facing['Hour'].astype(str)),
'cameraId': facing['cameraId'].astype(cam_cat),
'facing': facing.index
})
df2 = pd.DataFrame({
'DateTime': pd.to_datetime(label['Date'] + ' ' + label['Hour'].astype(str)),
'cameraId': label['cameraId'].astype(cam_cat),
'label': label.index
})
# 1st pass: lookup on DateTime and cameraId to keep only possible match
# Cross product of facing / label with valid DateTime / cameraId
dfm = df1.merge(df2, on=['DateTime', 'cameraId'])
CHUNKSIZE = 10 # Chuncksize
facing_cols = ['facing_center_x', 'facing_center_y']
label_cols = ['boundingX0', 'boundingX1', 'boundingY0', 'boundingY1', 'barcode']
# 2nd pass: match facing coords on bounding box
# Filter out the dataframe
mask = []
for i in range(0, len(dfm), CHUNKSIZE):
F = facing.loc[dfm.iloc[i:i+CHUNKSIZE]['facing'], facing_cols].reset_index(drop=True)
L = label.loc[dfm.iloc[i:i+CHUNKSIZE]['label'], label_cols].reset_index(drop=True)
mask.append(BoundingBoxContains(pd.concat([F, L], axis=1)))
dfm = dfm.loc[pd.concat(mask, ignore_index=True)]
输出:
>>> dfm
DateTime cameraId facing label
19 2022-05-17 11:00:00 Z4301160003414164 1 9
49 2022-05-17 11:00:00 Z4301160003414164 4 9
59 2022-05-17 11:00:00 Z4301160003414164 5 9
79 2022-05-17 11:00:00 Z4301160003414164 7 9
更新 3
最后一步是从 dfm
的 facing
和 label
列重建数据帧:
out = facing.loc[dfm['facing']].assign(barcode=label.loc[dfm['label'], 'barcode'].values)
print(out)
# Output
index boundingX0 boundingX1 boundingY0 boundingY1 cameraId \
1 1 1812.0 1906.0 1985.0 2152.0 Z4301160003414164
4 4 1909.0 2002.0 1983.0 2151.0 Z4301160003414164
5 5 1722.0 1808.0 1982.0 2150.0 Z4301160003414164
7 7 2359.0 2469.0 2512.0 2629.0 Z4301160003414164
filename Date Hour facing_center_x facing_center_y \
1 A 2022-05-17 11 1859.0 2068.5
4 A 2022-05-17 11 1955.5 2067.0
5 A 2022-05-17 11 1765.0 2066.0
7 A 2022-05-17 11 2414.0 2570.5
barcode
1 N4131465932913278
4 N4131465932913278
5 N4131465932913278
7 N4131465932913278