Excel/Python:如何按字母顺序对名称进行分段分组,以便每组具有相同数量的字母?
Excel/Python: How to group names alphabetically section wise such that each group has equal number of Alphabets?
我有一个虚拟 Excel Sheet 一所大学的 879 名学生,我正在与之合作在 python 中进行一些数据分析。此 excel sheet 有多个列,例如:
- 学生姓名
- 部分
- 出席率
- 报名编号
但是学生的名字没有按字母顺序正确排列。
我想明智地平均分配学生“部分”,以便每个部分的学生人数相等,每个学生的名字都以字母开头。
我尝试按字母顺序对 Excel 中的数据进行排序,但它按字母顺序对整个列进行排序,这是我不希望的。相反,我希望数据按“部分”明智地排列,以便每个部分都有名字以每个字母开头的学生,或者 none 如果该特定字母表的所有名称在前面的部分中已经用尽。所有部分的学生人数相等(或几乎相等)。
For example:
the dataset 12 sections with 879 students:
Section A has 74 students
Section B has 74 students
Section C has 74 students
Section D has 73 students
Section E has 73 students
Section F has 73 students
Section G has 73 students
Section H has 73 students
Section I has 73 students
Section J has 73 students
Section K has 73 students
Section L has 73 students
Number of students having first character A is 89
Number of students having first character B is 47
Number of students having first character C is 7
: : : :
: : : :
Number of students having first character Y is 1
Number of students having first character Z is 2
我的目标:
Section A will have:
- 7 students whose name start with A (89/12 = 7 students)
- 3 students whose name start with B (47/12 = 3 students)
- 1 student whose name start with C (As 7<12, so cant put all students in all sections, so 1)
: : :
: : :
- 1 student whose name start with Y
- 1 student whose name start with Z
Section B will have:
- 7 students whose name start with A
: : : : : :
: : : : : :
- 0 students whose name start with Y (As no student left with alphabet starting with Y)
- 1 student whose name start with Z
类似地,其他部分也会有这样的分布:
有没有办法使用 Excel 或 Python Pandas 库查询来实现此目的?
Here is my Excel Sheet
这可以使用 numpy
中的 linspace
函数来实现。
不是最优雅的解决方案,但工作正常。
import numpy as np
import pandas as pd
# Read the CSV file
df = pd.read_csv('Student_data.csv', header=2)[1:]
# Keep only first name initial
df['Name_initial'] = df['Student Name'].str[:1]
# Get a list of all sections
section_names = df['Section'].unique()
# Get a list of all initials
alphabets = df['Name_initial'].unique()
# Create a dictionary with initials and key and the total count as values
alphabet_counts = {i['index'] : i['Name_initial'] for i in df['Name_initial'].value_counts().reset_index().to_dict('records')}
# Create a dictionary that contains 0 students for each alphabet for each section
final_sections = {sec: {alpha:0 for alpha in alphabets} for sec in section_names}
# A function that takes the total count and the number of sections to be cut into
def split(total, sections, alphabet):
split_arr = list(np.linspace(0, total, len(sections) + 1, dtype='int')) # Here is where all the magic happens
final_arr = []
print(split_arr)
for ix, i in enumerate(split_arr): # You will get an array that looks
# something like [0, 7, 14, 22, 29, 37, 44, 51, 59, 66, 74, 81, 89]
# for which we need to take the difference between each values to get our partitions
if ix + 1 < len(split_arr):
final_arr.append({alphabet : split_arr[ix+1] - i})
return dict(zip(sections, final_arr)) # Finally create a dictionary containing all sections and number of students for the given alphabet
for section_n in final_sections:
for alphabet, counts in alphabet_counts.items():
temp = split(counts, section_names, alphabet) # Get the partition for each alphabet and for each section
final_sections[section_n][alphabet] = temp[section_n][alphabet] # From the obtained partitions, assign these numbers that were previously initialised to zero
for section, student_initials in final_sections.items(): # Print the dictionary as needed
print(f"Section {section} will have: ")
for init, count in student_initials.items():
print(f"- {count} students whose name start with {init}")
这是一个初步的解决方案。根据以相同字母开头的所有名称中特定名称的出现次数,与每个部分中应以该字母开头的名称数量相比,分配一个部分编号。 :
=LET(names,E5:E883,
rows,ROWS(names),
sections,12,
seq,SEQUENCE(rows,1,0),
startrow,XLOOKUP(LEFT(names,1)&"*",names,seq,,2),
counts,COUNTIF(names,LEFT(names,1)&"*"),
countspersection,counts/sections,
occurrence,seq-startrow,
section,QUOTIENT(occurrence+0.5,countspersection),
section)
然后按分配的节号排序:
=LET(names,E5:E883,
rows,ROWS(names),
sections,12,
seq,SEQUENCE(rows,1,0),
startrow,XLOOKUP(LEFT(names,1)&"*",names,seq,,2),
counts,COUNTIF(names,LEFT(names,1)&"*"),
countspersection,counts/sections,
occurrence,seq-startrow,
section,QUOTIENT(occurrence+0.5,countspersection),
SORTBY(names,section,1,names,1))
由于学生人数只能是整数,所以每个部分的大小都会有所不同,具体取决于分配的数字中有多少被低估了,有多少比理论值高估了(例如应该有每个部分都是 7.4 A,但实际上只能有 7 或 8)。我对此做了一些分析,小组规模下降是这样的:
很难让组大小和每个字母的数量都完全正确 - 我认为你需要一种迭代方法才能更进一步。
我有一个虚拟 Excel Sheet 一所大学的 879 名学生,我正在与之合作在 python 中进行一些数据分析。此 excel sheet 有多个列,例如:
- 学生姓名
- 部分
- 出席率
- 报名编号
但是学生的名字没有按字母顺序正确排列。
我想明智地平均分配学生“部分”,以便每个部分的学生人数相等,每个学生的名字都以字母开头。
我尝试按字母顺序对 Excel 中的数据进行排序,但它按字母顺序对整个列进行排序,这是我不希望的。相反,我希望数据按“部分”明智地排列,以便每个部分都有名字以每个字母开头的学生,或者 none 如果该特定字母表的所有名称在前面的部分中已经用尽。所有部分的学生人数相等(或几乎相等)。
For example:
the dataset 12 sections with 879 students:
Section A has 74 students
Section B has 74 students
Section C has 74 students
Section D has 73 students
Section E has 73 students
Section F has 73 students
Section G has 73 students
Section H has 73 students
Section I has 73 students
Section J has 73 students
Section K has 73 students
Section L has 73 students
Number of students having first character A is 89
Number of students having first character B is 47
Number of students having first character C is 7
: : : :
: : : :
Number of students having first character Y is 1
Number of students having first character Z is 2
我的目标:
Section A will have:
- 7 students whose name start with A (89/12 = 7 students)
- 3 students whose name start with B (47/12 = 3 students)
- 1 student whose name start with C (As 7<12, so cant put all students in all sections, so 1)
: : :
: : :
- 1 student whose name start with Y
- 1 student whose name start with Z
Section B will have:
- 7 students whose name start with A
: : : : : :
: : : : : :
- 0 students whose name start with Y (As no student left with alphabet starting with Y)
- 1 student whose name start with Z
类似地,其他部分也会有这样的分布:
有没有办法使用 Excel 或 Python Pandas 库查询来实现此目的?
Here is my Excel Sheet
这可以使用 numpy
中的 linspace
函数来实现。
不是最优雅的解决方案,但工作正常。
import numpy as np
import pandas as pd
# Read the CSV file
df = pd.read_csv('Student_data.csv', header=2)[1:]
# Keep only first name initial
df['Name_initial'] = df['Student Name'].str[:1]
# Get a list of all sections
section_names = df['Section'].unique()
# Get a list of all initials
alphabets = df['Name_initial'].unique()
# Create a dictionary with initials and key and the total count as values
alphabet_counts = {i['index'] : i['Name_initial'] for i in df['Name_initial'].value_counts().reset_index().to_dict('records')}
# Create a dictionary that contains 0 students for each alphabet for each section
final_sections = {sec: {alpha:0 for alpha in alphabets} for sec in section_names}
# A function that takes the total count and the number of sections to be cut into
def split(total, sections, alphabet):
split_arr = list(np.linspace(0, total, len(sections) + 1, dtype='int')) # Here is where all the magic happens
final_arr = []
print(split_arr)
for ix, i in enumerate(split_arr): # You will get an array that looks
# something like [0, 7, 14, 22, 29, 37, 44, 51, 59, 66, 74, 81, 89]
# for which we need to take the difference between each values to get our partitions
if ix + 1 < len(split_arr):
final_arr.append({alphabet : split_arr[ix+1] - i})
return dict(zip(sections, final_arr)) # Finally create a dictionary containing all sections and number of students for the given alphabet
for section_n in final_sections:
for alphabet, counts in alphabet_counts.items():
temp = split(counts, section_names, alphabet) # Get the partition for each alphabet and for each section
final_sections[section_n][alphabet] = temp[section_n][alphabet] # From the obtained partitions, assign these numbers that were previously initialised to zero
for section, student_initials in final_sections.items(): # Print the dictionary as needed
print(f"Section {section} will have: ")
for init, count in student_initials.items():
print(f"- {count} students whose name start with {init}")
这是一个初步的解决方案。根据以相同字母开头的所有名称中特定名称的出现次数,与每个部分中应以该字母开头的名称数量相比,分配一个部分编号。 :
=LET(names,E5:E883,
rows,ROWS(names),
sections,12,
seq,SEQUENCE(rows,1,0),
startrow,XLOOKUP(LEFT(names,1)&"*",names,seq,,2),
counts,COUNTIF(names,LEFT(names,1)&"*"),
countspersection,counts/sections,
occurrence,seq-startrow,
section,QUOTIENT(occurrence+0.5,countspersection),
section)
然后按分配的节号排序:
=LET(names,E5:E883,
rows,ROWS(names),
sections,12,
seq,SEQUENCE(rows,1,0),
startrow,XLOOKUP(LEFT(names,1)&"*",names,seq,,2),
counts,COUNTIF(names,LEFT(names,1)&"*"),
countspersection,counts/sections,
occurrence,seq-startrow,
section,QUOTIENT(occurrence+0.5,countspersection),
SORTBY(names,section,1,names,1))
由于学生人数只能是整数,所以每个部分的大小都会有所不同,具体取决于分配的数字中有多少被低估了,有多少比理论值高估了(例如应该有每个部分都是 7.4 A,但实际上只能有 7 或 8)。我对此做了一些分析,小组规模下降是这样的:
很难让组大小和每个字母的数量都完全正确 - 我认为你需要一种迭代方法才能更进一步。