Excel/Python:如何按字母顺序对名称进行分段分组,以便每组具有相同数量的字母?

Excel/Python: How to group names alphabetically section wise such that each group has equal number of Alphabets?

我有一个虚拟 Excel Sheet 一所大学的 879 名学生,我正在与之合作在 python 中进行一些数据分析。此 excel sheet 有多个列,例如:

  1. 学生姓名
  2. 部分
  3. 出席率
  4. 报名编号

但是学生的名字没有按字母顺序正确排列。

我想明智地平均分配学生“部分”,以便每个部分的学生人数相等,每个学生的名字都以字母开头。

我尝试按字母顺序对 Excel 中的数据进行排序,但它按字母顺序对整个列进行排序,这是我不希望的。相反,我希望数据按“部分”明智地排列,以便每个部分都有名字以每个字母开头的学生,或者 none 如果该特定字母表的所有名称在前面的部分中已经用尽。所有部分的学生人数相等(或几乎相等)。

For example:
the dataset 12 sections with 879 students:
Section A has 74 students
Section B has 74 students
Section C has 74 students
Section D has 73 students
Section E has 73 students
Section F has 73 students
Section G has 73 students
Section H has 73 students
Section I has 73 students
Section J has 73 students
Section K has 73 students
Section L has 73 students

Number of students having first character A is 89
Number of students having first character B is 47
Number of students having first character C is 7
    :         :            :              :
    :         :            :              :  
Number of students having first character Y is 1
Number of students having first character Z is 2

我的目标:

Section A will have: 
 - 7 students whose name start with A (89/12 = 7 students)
 - 3 students whose name start with B (47/12 = 3 students) 
 - 1 student whose name start with C  (As 7<12, so cant put all students in all sections, so 1)
    :         :            :             
    :         :            :                

 - 1 student whose name start with Y   
 - 1 student whose name start with Z

Section B will have: 
- 7 students whose name start with A
  :    :      :     :     :    :
  :    :      :     :     :    :
- 0 students whose name start with Y  (As no student left with alphabet starting with Y)
- 1 student whose name start with Z

类似地,其他部分也会有这样的分布:

有没有办法使用 Excel 或 Python Pandas 库查询来实现此目的?
Here is my Excel Sheet

这可以使用 numpy 中的 linspace 函数来实现。 不是最优雅的解决方案,但工作正常。

import numpy as np
import pandas as pd

# Read the CSV file
df = pd.read_csv('Student_data.csv', header=2)[1:]

# Keep only first name initial
df['Name_initial'] = df['Student Name'].str[:1]

# Get a list of all sections
section_names = df['Section'].unique()
# Get a list of all initials
alphabets = df['Name_initial'].unique()
# Create a dictionary with initials and key and the total count as values
alphabet_counts = {i['index'] : i['Name_initial'] for i in df['Name_initial'].value_counts().reset_index().to_dict('records')}

# Create a dictionary that contains 0 students for each alphabet for each section
final_sections = {sec: {alpha:0 for alpha in alphabets} for sec in section_names}

# A function that takes the total count and the number of sections to be cut into
def split(total, sections, alphabet):
    split_arr = list(np.linspace(0, total, len(sections) + 1, dtype='int')) # Here is where all the magic happens
    final_arr = []
    print(split_arr)
    for ix, i in enumerate(split_arr): # You will get an array that looks 
                                       # something like [0, 7, 14, 22, 29, 37, 44, 51, 59, 66, 74, 81, 89] 
                                       # for which we need to take the difference between each values to get our partitions
        if ix + 1 < len(split_arr):
            final_arr.append({alphabet : split_arr[ix+1] - i})
    return dict(zip(sections, final_arr)) # Finally create a dictionary containing all sections and number of students for the given alphabet


for section_n in final_sections:
    for alphabet, counts in alphabet_counts.items():
        temp = split(counts, section_names, alphabet) # Get the partition for each alphabet and for each section
        final_sections[section_n][alphabet] = temp[section_n][alphabet] # From the obtained partitions, assign these numbers that were previously initialised to zero
        
for section, student_initials in final_sections.items(): # Print the dictionary as needed
    print(f"Section {section} will have: ")
    for init, count in student_initials.items():
        print(f"- {count} students whose name start with {init}")

这是一个初步的解决方案。根据以相同字母开头的所有名称中特定名称的出现次数,与每个部分中应以该字母开头的名称数量相比,分配一个部分编号。 :

=LET(names,E5:E883,
rows,ROWS(names),
sections,12,
seq,SEQUENCE(rows,1,0),
startrow,XLOOKUP(LEFT(names,1)&"*",names,seq,,2),
counts,COUNTIF(names,LEFT(names,1)&"*"),
countspersection,counts/sections,
occurrence,seq-startrow,
section,QUOTIENT(occurrence+0.5,countspersection),
section)

然后按分配的节号排序:

=LET(names,E5:E883,
rows,ROWS(names),
sections,12,
seq,SEQUENCE(rows,1,0),
startrow,XLOOKUP(LEFT(names,1)&"*",names,seq,,2),
counts,COUNTIF(names,LEFT(names,1)&"*"),
countspersection,counts/sections,
occurrence,seq-startrow,
section,QUOTIENT(occurrence+0.5,countspersection),
SORTBY(names,section,1,names,1))

由于学生人数只能是整数,所以每个部分的大小都会有所不同,具体取决于分配的数字中有多少被低估了,有多少比理论值高估了(例如应该有每个部分都是 7.4 A,但实际上只能有 7 或 8)。我对此做了一些分析,小组规模下降是这样的:

很难让组大小和每个字母的数量都完全正确 - 我认为你需要一种迭代方法才能更进一步。