在 pandas df 中重新分配列值
Re-assign column values in a pandas df
这个问题与名册或人员配置有关。我正在尝试将各种工作分配给个人(员工)。使用下面的 df
,
`[Person]` = Individuals (employees)
`[Area]` and `[Place]` = unique jobs
`[On]` = How many unique jobs are occurring at each point in time
因此 [Area]
和 [Place]
一起将构成 unique
不同作业的值。这些值将分配给个人,总体目标是使用尽可能少的个人。对任何一个人而言,最独特的值 assigned
是 3。[On]
显示 [Place]
和 [Area]
的当前 unique
值正在出现的数量。因此,这提供了关于我需要多少人的具体指南。例如,
1-3 unique values occurring = 1 individual
4-6 unique values occurring = 2 individuals
7-9 unique values occurring = 3 individuals etc
问题:
[Area]
和 [Place]
中的 unique
值的数量大于 3 的地方给我带来了麻烦。我不能做一个 groupby
,其中我 assign
第一个 3 unique values
到 individual 1
和接下来的 3 个 unique
值到 individual 2
等等。我想要按 [Area]
对 [Area]
和 [Place]
中的唯一值进行分组。因此,在 [Area]
中查看 assign
与个人(最多 3 个)相同的值。然后,如果有 剩余 个值 (<3),则应尽可能将它们组合成 3 个一组。
我设想的工作方式是:展望未来 hour
。对于每个新的 row
值,script
应该看到有多少值将是 [On]
(这表明需要多少个人)。 unique
值大于 3 时,它们应该 assigned
乘 grouping
与 [Area]
中的值相同。如果有 leftover 个值,它们应该以任何方式组合成一组 3 个。
将其纳入一个循序渐进的过程中:
1) 使用 [On]
Column
通过查看 未来 来确定需要多少人 hour
2) 如果出现超过 3 个 unique
值,则首先在 [Area]
中分配相同的值。
3) 如果有任何 leftover 值,则尽可能合并。
对于下面的 df
,[Place]
和 [Area]
有 9 个 unique
值,其中有一个 hour
。所以我们应该有 3 个人 assigned
。当 unique
值 >3 时,它应该由 [Area]
分配并查看是否出现相同的值。 剩余 值应与其他少于 3 unique
值的个体相结合。
import pandas as pd
import numpy as np
d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
'Area' : ['A','B','C','D','E','D','E','F','G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
})
df = pd.DataFrame(data=d)
这是我的尝试:
def reduce_df(df):
values = df['Area'] + df['Place']
df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
person_count = df1.groupby('Person')['Person'].agg('count')
leftover_count = person_count[person_count < 3] # the 'leftovers'
# try merging pairs together
nleft = leftover_count.shape[0]
to_try = np.arange(nleft - 1)
to_merge = (leftover_count.values[to_try] +
leftover_count.values[to_try + 1]) <= 3
to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
to_merge = to_try[to_merge]
merge_dict = dict(zip(leftover_count.index.values[to_merge+1],
leftover_count.index.values[to_merge]))
def change_person(p):
if p in merge_dict.keys():
return merge_dict[p]
return p
reduced_df = df.copy()
# update df with the merges you found
reduced_df['Person'] = reduced_df['Person'].apply(change_person)
return reduced_df
df1 = (reduce_df(reduce_df(df)))
这是输出:
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 4
4 8:35:00 House 5 E 5 Person 5
5 8:40:00 House 1 D 6 Person 4
6 8:42:00 House 2 E 7 Person 5
7 8:45:00 House 3 F 8 Person 5
8 8:50:00 House 2 G 9 Person 7
这是我的预期输出:
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 5 E 5 Person 3
5 8:40:00 House 6 D 6 Person 2
6 8:42:00 House 2 E 7 Person 3
7 8:45:00 House 3 F 8 Person 2
8 8:50:00 House 2 G 9 Person 3
关于如何获得此输出的说明:
Index 0: One `unique` value occurring. So `assign` to individual 1
Index 1: Two `unique` values occurring. So `assign` to individual 1
Index 2: Three `unique` values occurring. So `assign` to individual 1
Index 3: Four `unique` values on. So `assign` to individual 2
Index 4: Five `unique` values on. This one is a bit tricky and hard to conceptualise. But there is another `E` within an `hour`. So `assign` to a new individual so it can be combined with the other `E`
Index 5: Six `unique` values on. Should be `assigned` with the other `D`. So individual 2
Index 6: Seven `unique` values on. Should be `assigned` with other `E`. So individual 3
Index 7: Eight `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
Index 8: Nine `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
示例 2:
d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
'Area' : ['X','X','X','X','X','X','X','X','X'],
'On' : ['1','2','3','3','3','3','3','3','3'],
'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
})
df = pd.DataFrame(data=d)
我遇到一个错误:
IndexError: index 1 is out of bounds for axis 1 with size 1
这一行:
df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
但是,如果我将 Person 更改为 1、2、3 重复,它 returns 如下:
'Person' : ['Person 1','Person 2','Person 3','Person 1','Person 2','Person 3','Person 1','Person 2','Person 3'],
Time Place Area On Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 2
4 8:35:00 House 2 X 3 Person 2
5 8:40:00 House 3 X 3 Person 2
6 8:42:00 House 1 X 3 Person 3
7 8:45:00 House 2 X 3 Person 3
8 8:50:00 House 3 X 3 Person 3
预期输出:
Time Place Area On Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 1
4 8:35:00 House 2 X 3 Person 1
5 8:40:00 House 3 X 3 Person 1
6 8:42:00 House 1 X 3 Person 1
7 8:45:00 House 2 X 3 Person 1
8 8:50:00 House 3 X 3 Person 1
示例 2 的主要内容是:
1) There are <3 unique values on so assign to individual 1
更新
There's a live version of this answer online that you can try for yourself.
这是 allocatePeople
函数形式的答案。它基于预先计算区域在一小时内重复的所有索引:
from collections import Counter
import numpy as np
import pandas as pd
def getAssignedPeople(df, areasPerPerson):
areas = df['Area'].values
places = df['Place'].values
times = pd.to_datetime(df['Time']).values
maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
assignmentCount = Counter()
assignedPeople = []
assignedPlaces = {}
heldPeople = {}
heldAreas = {}
holdAvailable = True
person = 0
# search for repeated areas. Mark them if the next repeat occurs within an hour
ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
holds = np.zeros(areas.size, dtype=bool)
holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')
for area,place,hold in zip(areas, places, holds):
if (area, place) in assignedPlaces:
# this unique (area, place) has already been assigned to someone
assignedPeople.append(assignedPlaces[(area, place)])
continue
if assignmentCount[person] >= areasPerPerson:
# the current person is already assigned to enough areas, move on to the next
a = heldPeople.pop(person, None)
heldAreas.pop(a, None)
person += 1
if area in heldAreas:
# assign to the person held in this area
p = heldAreas.pop(area)
heldPeople.pop(p)
else:
# get the first non-held person. If we need to hold in this area,
# also make sure the person has at least 2 free assignment slots,
# though if it's the last person assign to them anyway
p = person
while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
p += 1
assignmentCount.update([p])
assignedPlaces[(area, place)] = p
assignedPeople.append(p)
if hold:
if p==maxPerson:
# mark that there are no more people available to perform holds
holdAvailable = False
# this area recurrs in an hour, mark that the person should be held here
heldPeople[p] = area
heldAreas[area] = p
return assignedPeople
def allocatePeople(df, areasPerPerson=3):
assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
df = df.copy()
df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
return df
注意 allocatePeople
中 df['Person'].unique()
的使用。这处理了人们在输入中重复的情况。假定输入中人员的顺序是分配这些人员所需的顺序。
我针对 OP 的示例输入(example1
和 example2
)测试了 allocatePeople
,还针对我认为(?)匹配的几个边缘案例进行了测试OP 所需的算法:
ds = dict(
example1 = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
'Area' : ['A','B','C','D','E','D','E','F','G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
}),
example2 = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
'Area' : ['X','X','X','X','X','X','X','X','X'],
'On' : ['1','2','3','3','3','3','3','3','3'],
'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
}),
long_repeats = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:25:00','8:30:00','8:31:00','8:35:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 3','House 2'],
'Area' : ['A','A','A','A','B','C','C','C','B'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 4','Person 4','Person 3'],
'On' : ['1','2','3','4','5','6','7','8','9'],
}),
many_repeats = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 2'],
'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'F'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
}),
large_gap = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 3'],
'Area' : ['A', 'B', 'C', 'D', 'E', 'F', 'D', 'D', 'D'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
}),
different_times = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','09:42:00','09:45:00','09:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 1'],
'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
})
)
expectedPeoples = dict(
example1 = [1,1,1,2,3,2,3,2,3],
example2 = [1,1,1,1,1,1,1,1,1],
long_repeats = [1,1,1,2,2,3,3,3,2],
many_repeats = [1,1,1,2,2,3,3,2,3],
large_gap = [1,1,1,2,3,3,2,2,3],
different_times = [1,1,1,2,2,2,3,3,3],
)
for name,d in ds.items():
df = pd.DataFrame(d)
expected = ['Person %d' % i for i in expectedPeoples[name]]
ap = allocatePeople(df)
print(name, ap, sep='\n', end='\n\n')
np.testing.assert_array_equal(ap['Person'], expected)
assert_array_equal
语句通过,输出符合 OP 的预期输出:
example1
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 5 E 5 Person 3
5 08:40:00 House 1 D 6 Person 2
6 08:42:00 House 2 E 7 Person 3
7 08:45:00 House 3 F 8 Person 2
8 08:50:00 House 2 G 9 Person 3
example2
Time Place Area On Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 1
4 8:35:00 House 2 X 3 Person 1
5 8:40:00 House 3 X 3 Person 1
6 8:42:00 House 1 X 3 Person 1
7 8:45:00 House 2 X 3 Person 1
8 8:50:00 House 3 X 3 Person 1
我的测试用例的输出也符合我的预期:
long_repeats
Time Place Area Person On
0 8:03:00 House 1 A Person 1 1
1 8:17:00 House 2 A Person 1 2
2 8:20:00 House 3 A Person 1 3
3 8:25:00 House 4 A Person 2 4
4 8:30:00 House 1 B Person 2 5
5 8:31:00 House 1 C Person 3 6
6 8:35:00 House 2 C Person 3 7
7 8:45:00 House 3 C Person 3 8
8 8:50:00 House 2 B Person 2 9
many_repeats
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 1 D 5 Person 2
5 08:40:00 House 1 E 6 Person 3
6 08:42:00 House 2 E 7 Person 3
7 08:45:00 House 1 F 8 Person 2
8 08:50:00 House 2 F 9 Person 3
large_gap
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 1 E 5 Person 3
5 08:40:00 House 1 F 6 Person 3
6 08:42:00 House 2 D 7 Person 2
7 08:45:00 House 1 D 8 Person 2
8 08:50:00 House 3 D 9 Person 3
different_times
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 1 D 5 Person 2
5 08:40:00 House 1 E 6 Person 2
6 09:42:00 House 2 E 7 Person 3
7 09:45:00 House 1 F 8 Person 3
8 09:50:00 House 1 G 9 Person 3
让我知道它是否满足您的所有需求,或者它是否仍需要一些调整。我想每个人都渴望看到你实现你的愿景。
好的,在我们深入研究问题的逻辑之前,值得做一些内务处理来整理数据并将其转换为更有用的格式:
#Create table of unique people
unique_people = df[['Person']].drop_duplicates().sort_values(['Person']).reset_index(drop=True)
#Reformat time column
df['Time'] = pd.to_datetime(df['Time'])
现在,了解问题的逻辑,将问题分解为多个阶段很有用。首先,我们要根据 'Area' 和它们之间的时间创建单独的作业(带有作业编号)。即同一地区的职位,一小时内可以共享同一个职位号。
#Assign jobs
df= df.sort_values(['Area','Time']).reset_index(drop=True)
df['Job no'] = 0
current_job = 1
df.loc[0,'Job no'] = current_job
for i in range(rows-1):
prev_row = df.loc[i]
row = df.loc[i+1]
time_diff = (row['Time'] - prev_row['Time']).seconds //3600
if (row['Area'] == prev_row['Area']) & (time_diff == 0):
pass
else:
current_job +=1
df.loc[i+1,'Job no'] = current_job
现在完成此步骤后,将 'Persons' 分配给各个作业就变得简单了:
df= df.sort_values(['Job no']).reset_index(drop=True)
df['Person'] = ""
df_groups = df.groupby('Job no')
for group in df_groups:
group_size = group[1].count()['Time']
for person_idx in range(len(unique_people)):
person = unique_people.loc[person_idx]['Person']
person_count = df[df['Person']==person]['Person'].count()
if group_size <= (3-person_count):
idx = group[1].index.values
df.loc[idx,'Person'] = person
break
最后,
df= df.sort_values(['Time']).reset_index(drop=True)
print(df)
我试图以一种更容易取消选择的方式对此进行编码,因此这里很可能会提高效率。然而,目的是阐明所使用的逻辑。
这段代码给出了两个数据集的预期结果,所以我希望它能回答你的问题。
在写我的 时,我慢慢意识到 OP 的算法可能更容易通过关注工作(可能不同)而不是人的方法来实现(都是一样的)。这是一个使用以工作为中心的方法的解决方案:
from collections import Counter
import numpy as np
import pandas as pd
def assignJob(job, assignedix, areasPerPerson):
for i in range(len(assignedix)):
if (areasPerPerson - len(assignedix[i])) >= len(job):
assignedix[i].extend(job)
return True
else:
return False
def allocatePeople(df, areasPerPerson=3):
areas = df['Area'].values
times = pd.to_datetime(df['Time']).values
peopleUniq = df['Person'].unique()
npeople = int(np.ceil(areas.size / float(areasPerPerson)))
# search for repeated areas. Mark them if the next repeat occurs within an hour
ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
holds = np.zeros(areas.size, dtype=bool)
holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')
jobs =[]
_jobdict = {}
for i,(area,hold) in enumerate(zip(areas, holds)):
if hold:
_jobdict[area] = job = _jobdict.get(area, []) + [i]
if len(job)==areasPerPerson:
jobs.append(_jobdict.pop(area))
elif area in _jobdict:
jobs.append(_jobdict.pop(area) + [i])
else:
jobs.append([i])
jobs.sort()
assignedix = [[] for i in range(npeople)]
for job in jobs:
if not assignJob(job, assignedix, areasPerPerson):
# break the job up and try again
for subjob in ([sj] for sj in job):
assignJob(subjob, assignedix, areasPerPerson)
df = df.copy()
for i,aix in enumerate(assignedix):
df.loc[aix, 'Person'] = peopleUniq[i]
return df
这个版本的 allocatePeople
也经过了广泛的测试,并通过了我在其他答案中描述的所有相同检查。
它确实比我的其他解决方案有更多的循环,所以它的效率可能会稍微低一些(尽管只有当你的数据帧非常大时才重要,比如 1e6
行及以上)。另一方面,它更短一些,而且我认为更直接、更容易理解。
这个问题与名册或人员配置有关。我正在尝试将各种工作分配给个人(员工)。使用下面的 df
,
`[Person]` = Individuals (employees)
`[Area]` and `[Place]` = unique jobs
`[On]` = How many unique jobs are occurring at each point in time
因此 [Area]
和 [Place]
一起将构成 unique
不同作业的值。这些值将分配给个人,总体目标是使用尽可能少的个人。对任何一个人而言,最独特的值 assigned
是 3。[On]
显示 [Place]
和 [Area]
的当前 unique
值正在出现的数量。因此,这提供了关于我需要多少人的具体指南。例如,
1-3 unique values occurring = 1 individual
4-6 unique values occurring = 2 individuals
7-9 unique values occurring = 3 individuals etc
问题:
[Area]
和 [Place]
中的 unique
值的数量大于 3 的地方给我带来了麻烦。我不能做一个 groupby
,其中我 assign
第一个 3 unique values
到 individual 1
和接下来的 3 个 unique
值到 individual 2
等等。我想要按 [Area]
对 [Area]
和 [Place]
中的唯一值进行分组。因此,在 [Area]
中查看 assign
与个人(最多 3 个)相同的值。然后,如果有 剩余 个值 (<3),则应尽可能将它们组合成 3 个一组。
我设想的工作方式是:展望未来 hour
。对于每个新的 row
值,script
应该看到有多少值将是 [On]
(这表明需要多少个人)。 unique
值大于 3 时,它们应该 assigned
乘 grouping
与 [Area]
中的值相同。如果有 leftover 个值,它们应该以任何方式组合成一组 3 个。
将其纳入一个循序渐进的过程中:
1) 使用 [On]
Column
通过查看 未来 来确定需要多少人 hour
2) 如果出现超过 3 个 unique
值,则首先在 [Area]
中分配相同的值。
3) 如果有任何 leftover 值,则尽可能合并。
对于下面的 df
,[Place]
和 [Area]
有 9 个 unique
值,其中有一个 hour
。所以我们应该有 3 个人 assigned
。当 unique
值 >3 时,它应该由 [Area]
分配并查看是否出现相同的值。 剩余 值应与其他少于 3 unique
值的个体相结合。
import pandas as pd
import numpy as np
d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
'Area' : ['A','B','C','D','E','D','E','F','G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
})
df = pd.DataFrame(data=d)
这是我的尝试:
def reduce_df(df):
values = df['Area'] + df['Place']
df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
person_count = df1.groupby('Person')['Person'].agg('count')
leftover_count = person_count[person_count < 3] # the 'leftovers'
# try merging pairs together
nleft = leftover_count.shape[0]
to_try = np.arange(nleft - 1)
to_merge = (leftover_count.values[to_try] +
leftover_count.values[to_try + 1]) <= 3
to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
to_merge = to_try[to_merge]
merge_dict = dict(zip(leftover_count.index.values[to_merge+1],
leftover_count.index.values[to_merge]))
def change_person(p):
if p in merge_dict.keys():
return merge_dict[p]
return p
reduced_df = df.copy()
# update df with the merges you found
reduced_df['Person'] = reduced_df['Person'].apply(change_person)
return reduced_df
df1 = (reduce_df(reduce_df(df)))
这是输出:
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 4
4 8:35:00 House 5 E 5 Person 5
5 8:40:00 House 1 D 6 Person 4
6 8:42:00 House 2 E 7 Person 5
7 8:45:00 House 3 F 8 Person 5
8 8:50:00 House 2 G 9 Person 7
这是我的预期输出:
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 5 E 5 Person 3
5 8:40:00 House 6 D 6 Person 2
6 8:42:00 House 2 E 7 Person 3
7 8:45:00 House 3 F 8 Person 2
8 8:50:00 House 2 G 9 Person 3
关于如何获得此输出的说明:
Index 0: One `unique` value occurring. So `assign` to individual 1
Index 1: Two `unique` values occurring. So `assign` to individual 1
Index 2: Three `unique` values occurring. So `assign` to individual 1
Index 3: Four `unique` values on. So `assign` to individual 2
Index 4: Five `unique` values on. This one is a bit tricky and hard to conceptualise. But there is another `E` within an `hour`. So `assign` to a new individual so it can be combined with the other `E`
Index 5: Six `unique` values on. Should be `assigned` with the other `D`. So individual 2
Index 6: Seven `unique` values on. Should be `assigned` with other `E`. So individual 3
Index 7: Eight `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
Index 8: Nine `unique` values on. New value in `[Area]`, which is a _leftover_. `Assign` to either individual 2 or 3
示例 2:
d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
'Area' : ['X','X','X','X','X','X','X','X','X'],
'On' : ['1','2','3','3','3','3','3','3','3'],
'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
})
df = pd.DataFrame(data=d)
我遇到一个错误:
IndexError: index 1 is out of bounds for axis 1 with size 1
这一行:
df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
但是,如果我将 Person 更改为 1、2、3 重复,它 returns 如下:
'Person' : ['Person 1','Person 2','Person 3','Person 1','Person 2','Person 3','Person 1','Person 2','Person 3'],
Time Place Area On Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 2
4 8:35:00 House 2 X 3 Person 2
5 8:40:00 House 3 X 3 Person 2
6 8:42:00 House 1 X 3 Person 3
7 8:45:00 House 2 X 3 Person 3
8 8:50:00 House 3 X 3 Person 3
预期输出:
Time Place Area On Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 1
4 8:35:00 House 2 X 3 Person 1
5 8:40:00 House 3 X 3 Person 1
6 8:42:00 House 1 X 3 Person 1
7 8:45:00 House 2 X 3 Person 1
8 8:50:00 House 3 X 3 Person 1
示例 2 的主要内容是:
1) There are <3 unique values on so assign to individual 1
更新
There's a live version of this answer online that you can try for yourself.
这是 allocatePeople
函数形式的答案。它基于预先计算区域在一小时内重复的所有索引:
from collections import Counter
import numpy as np
import pandas as pd
def getAssignedPeople(df, areasPerPerson):
areas = df['Area'].values
places = df['Place'].values
times = pd.to_datetime(df['Time']).values
maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
assignmentCount = Counter()
assignedPeople = []
assignedPlaces = {}
heldPeople = {}
heldAreas = {}
holdAvailable = True
person = 0
# search for repeated areas. Mark them if the next repeat occurs within an hour
ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
holds = np.zeros(areas.size, dtype=bool)
holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')
for area,place,hold in zip(areas, places, holds):
if (area, place) in assignedPlaces:
# this unique (area, place) has already been assigned to someone
assignedPeople.append(assignedPlaces[(area, place)])
continue
if assignmentCount[person] >= areasPerPerson:
# the current person is already assigned to enough areas, move on to the next
a = heldPeople.pop(person, None)
heldAreas.pop(a, None)
person += 1
if area in heldAreas:
# assign to the person held in this area
p = heldAreas.pop(area)
heldPeople.pop(p)
else:
# get the first non-held person. If we need to hold in this area,
# also make sure the person has at least 2 free assignment slots,
# though if it's the last person assign to them anyway
p = person
while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
p += 1
assignmentCount.update([p])
assignedPlaces[(area, place)] = p
assignedPeople.append(p)
if hold:
if p==maxPerson:
# mark that there are no more people available to perform holds
holdAvailable = False
# this area recurrs in an hour, mark that the person should be held here
heldPeople[p] = area
heldAreas[area] = p
return assignedPeople
def allocatePeople(df, areasPerPerson=3):
assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
df = df.copy()
df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
return df
注意 allocatePeople
中 df['Person'].unique()
的使用。这处理了人们在输入中重复的情况。假定输入中人员的顺序是分配这些人员所需的顺序。
我针对 OP 的示例输入(example1
和 example2
)测试了 allocatePeople
,还针对我认为(?)匹配的几个边缘案例进行了测试OP 所需的算法:
ds = dict(
example1 = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 5','House 1','House 2','House 3','House 2'],
'Area' : ['A','B','C','D','E','D','E','F','G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 2','Person 3','Person 4','Person 5','Person 4','Person 5','Person 6','Person 7'],
}),
example2 = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','8:40:00','8:42:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 1','House 2','House 3'],
'Area' : ['X','X','X','X','X','X','X','X','X'],
'On' : ['1','2','3','3','3','3','3','3','3'],
'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1'],
}),
long_repeats = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:25:00','8:30:00','8:31:00','8:35:00','8:45:00','8:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 3','House 2'],
'Area' : ['A','A','A','A','B','C','C','C','B'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 4','Person 4','Person 3'],
'On' : ['1','2','3','4','5','6','7','8','9'],
}),
many_repeats = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 2'],
'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'F'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
}),
large_gap = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','08:42:00','08:45:00','08:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 3'],
'Area' : ['A', 'B', 'C', 'D', 'E', 'F', 'D', 'D', 'D'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
}),
different_times = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:28:00','8:35:00','08:40:00','09:42:00','09:45:00','09:50:00'],
'Place' : ['House 1','House 2','House 3','House 4','House 1','House 1','House 2','House 1','House 1'],
'Area' : ['A', 'B', 'C', 'D', 'D', 'E', 'E', 'F', 'G'],
'On' : ['1','2','3','4','5','6','7','8','9'],
'Person' : ['Person 1','Person 1','Person 1','Person 2','Person 3','Person 4','Person 3','Person 5','Person 6'],
})
)
expectedPeoples = dict(
example1 = [1,1,1,2,3,2,3,2,3],
example2 = [1,1,1,1,1,1,1,1,1],
long_repeats = [1,1,1,2,2,3,3,3,2],
many_repeats = [1,1,1,2,2,3,3,2,3],
large_gap = [1,1,1,2,3,3,2,2,3],
different_times = [1,1,1,2,2,2,3,3,3],
)
for name,d in ds.items():
df = pd.DataFrame(d)
expected = ['Person %d' % i for i in expectedPeoples[name]]
ap = allocatePeople(df)
print(name, ap, sep='\n', end='\n\n')
np.testing.assert_array_equal(ap['Person'], expected)
assert_array_equal
语句通过,输出符合 OP 的预期输出:
example1
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 5 E 5 Person 3
5 08:40:00 House 1 D 6 Person 2
6 08:42:00 House 2 E 7 Person 3
7 08:45:00 House 3 F 8 Person 2
8 08:50:00 House 2 G 9 Person 3
example2
Time Place Area On Person
0 8:03:00 House 1 X 1 Person 1
1 8:17:00 House 2 X 2 Person 1
2 8:20:00 House 3 X 3 Person 1
3 8:28:00 House 1 X 3 Person 1
4 8:35:00 House 2 X 3 Person 1
5 8:40:00 House 3 X 3 Person 1
6 8:42:00 House 1 X 3 Person 1
7 8:45:00 House 2 X 3 Person 1
8 8:50:00 House 3 X 3 Person 1
我的测试用例的输出也符合我的预期:
long_repeats
Time Place Area Person On
0 8:03:00 House 1 A Person 1 1
1 8:17:00 House 2 A Person 1 2
2 8:20:00 House 3 A Person 1 3
3 8:25:00 House 4 A Person 2 4
4 8:30:00 House 1 B Person 2 5
5 8:31:00 House 1 C Person 3 6
6 8:35:00 House 2 C Person 3 7
7 8:45:00 House 3 C Person 3 8
8 8:50:00 House 2 B Person 2 9
many_repeats
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 1 D 5 Person 2
5 08:40:00 House 1 E 6 Person 3
6 08:42:00 House 2 E 7 Person 3
7 08:45:00 House 1 F 8 Person 2
8 08:50:00 House 2 F 9 Person 3
large_gap
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 1 E 5 Person 3
5 08:40:00 House 1 F 6 Person 3
6 08:42:00 House 2 D 7 Person 2
7 08:45:00 House 1 D 8 Person 2
8 08:50:00 House 3 D 9 Person 3
different_times
Time Place Area On Person
0 8:03:00 House 1 A 1 Person 1
1 8:17:00 House 2 B 2 Person 1
2 8:20:00 House 3 C 3 Person 1
3 8:28:00 House 4 D 4 Person 2
4 8:35:00 House 1 D 5 Person 2
5 08:40:00 House 1 E 6 Person 2
6 09:42:00 House 2 E 7 Person 3
7 09:45:00 House 1 F 8 Person 3
8 09:50:00 House 1 G 9 Person 3
让我知道它是否满足您的所有需求,或者它是否仍需要一些调整。我想每个人都渴望看到你实现你的愿景。
好的,在我们深入研究问题的逻辑之前,值得做一些内务处理来整理数据并将其转换为更有用的格式:
#Create table of unique people
unique_people = df[['Person']].drop_duplicates().sort_values(['Person']).reset_index(drop=True)
#Reformat time column
df['Time'] = pd.to_datetime(df['Time'])
现在,了解问题的逻辑,将问题分解为多个阶段很有用。首先,我们要根据 'Area' 和它们之间的时间创建单独的作业(带有作业编号)。即同一地区的职位,一小时内可以共享同一个职位号。
#Assign jobs
df= df.sort_values(['Area','Time']).reset_index(drop=True)
df['Job no'] = 0
current_job = 1
df.loc[0,'Job no'] = current_job
for i in range(rows-1):
prev_row = df.loc[i]
row = df.loc[i+1]
time_diff = (row['Time'] - prev_row['Time']).seconds //3600
if (row['Area'] == prev_row['Area']) & (time_diff == 0):
pass
else:
current_job +=1
df.loc[i+1,'Job no'] = current_job
现在完成此步骤后,将 'Persons' 分配给各个作业就变得简单了:
df= df.sort_values(['Job no']).reset_index(drop=True)
df['Person'] = ""
df_groups = df.groupby('Job no')
for group in df_groups:
group_size = group[1].count()['Time']
for person_idx in range(len(unique_people)):
person = unique_people.loc[person_idx]['Person']
person_count = df[df['Person']==person]['Person'].count()
if group_size <= (3-person_count):
idx = group[1].index.values
df.loc[idx,'Person'] = person
break
最后,
df= df.sort_values(['Time']).reset_index(drop=True)
print(df)
我试图以一种更容易取消选择的方式对此进行编码,因此这里很可能会提高效率。然而,目的是阐明所使用的逻辑。
这段代码给出了两个数据集的预期结果,所以我希望它能回答你的问题。
在写我的
from collections import Counter
import numpy as np
import pandas as pd
def assignJob(job, assignedix, areasPerPerson):
for i in range(len(assignedix)):
if (areasPerPerson - len(assignedix[i])) >= len(job):
assignedix[i].extend(job)
return True
else:
return False
def allocatePeople(df, areasPerPerson=3):
areas = df['Area'].values
times = pd.to_datetime(df['Time']).values
peopleUniq = df['Person'].unique()
npeople = int(np.ceil(areas.size / float(areasPerPerson)))
# search for repeated areas. Mark them if the next repeat occurs within an hour
ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
holds = np.zeros(areas.size, dtype=bool)
holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')
jobs =[]
_jobdict = {}
for i,(area,hold) in enumerate(zip(areas, holds)):
if hold:
_jobdict[area] = job = _jobdict.get(area, []) + [i]
if len(job)==areasPerPerson:
jobs.append(_jobdict.pop(area))
elif area in _jobdict:
jobs.append(_jobdict.pop(area) + [i])
else:
jobs.append([i])
jobs.sort()
assignedix = [[] for i in range(npeople)]
for job in jobs:
if not assignJob(job, assignedix, areasPerPerson):
# break the job up and try again
for subjob in ([sj] for sj in job):
assignJob(subjob, assignedix, areasPerPerson)
df = df.copy()
for i,aix in enumerate(assignedix):
df.loc[aix, 'Person'] = peopleUniq[i]
return df
这个版本的 allocatePeople
也经过了广泛的测试,并通过了我在其他答案中描述的所有相同检查。
它确实比我的其他解决方案有更多的循环,所以它的效率可能会稍微低一些(尽管只有当你的数据帧非常大时才重要,比如 1e6
行及以上)。另一方面,它更短一些,而且我认为更直接、更容易理解。