我如何将调查中不同类型的响应明确分组到通用组中以更轻松地处理数据?

How do I categorically group responses typed out differently in a survey into common groups to process the data easier?

我正在预处理通过 .csv 文件在调查中收到的数据。此列包含学生已修读的课程名称。因为这是他们打出来的,所以同一个课程名称有不同的拼写方式。例如:课程名称 'B.A. L.L.B.' 已被输入为 'Ballb' 或 'bal.l.b.' 等。我已经尝试了我能想到的最基本的蛮力方法,我可以在哪里选择所有选项一个 if 语句并将它们替换为通用课程拼写,但我仍然得到该程序无法归入其中一个语句的大量值。有没有更快的方法将它们组合在一起?

def get_course_name(x):
if 'B.E' in x or 'B.E.' in x or 'BE' in x or 'B.E(cse)' in x or 'Bachelor Of Engineering' in x or 'BECSE' in x or 'Be' in x:
    return 'B.E.'
if 'L.L.B.' in x or 'Ballb(h)' in x or 'Ballb' in x:
    return 'B.A. LLB'
if 'B.Tech' in x or 'B.TECH' in x or 'B.tech' in x or 'B .Tech' in x or 'Btech' in x or 'BTech' in x or 'B-tech' in x or 'B.Tech.' in x or 'CSE' in x or 'Biotechnology' in x or 'Biotech' in x:
    return 'B. Tech' 
if 'B.pharmacy' in x or 'B. Pharmacy' in x or 'B pharma' in x or 'pharmacy' in x or 'B.Pharmacy' in x or 'M.pharmacy' in x or 'B.Pharm' in x or 'Pharma' in x or 'pharm' in x or 'Pharmacy' in x or 'B.pharma' in x or 'B-pharmacy' in x:
    return 'B. Pharma'
if 'BBA' in x or 'bba' in x:
    return 'BBA'
if 'MBA' in x or 'mba' in x or 'Mba' in x or 'MBA ' in x:
    return 'MBA'
if 'M.Tech' in x or 'M. Tech' in x or 'mtech' in x or 'm.tech' in x or 'M-tech' in x or 'Mtec-EE' in x:
    return 'M. Tech'
if 'MBBS' in x or 'mbbs' in x:
    return 'MBBS'
if 'B.Sc' in x or 'B. Sc' in x or 'Bsc.' in x or 'B.S.c' in x:
    return 'B. Science'
if 'msc' in x or 'M.Sc' in x or 'M. Sc' in x or 'Msc' in x or 'MSc' in x or 'm.sc' in x:
    return 'M. Science'
return 'misc'

这是我调用函数以获取每门课程的价值计数的地方:

df1['Course Name'] = df1['Course Name'].apply(get_course_name)
df1['Course Name'].value_counts()

这就是 dataframe 的样子

我要分组的列名为 'Course Name'。

这样的怎么样?:

course_key_to_id = {
    'msc': 'M. Science',
    'bba': 'BBA',
    # + the rest lower case without punctuation: normalized name
}

def get_course_name(course_name):
    course_name = course_name.replace('.', '').replace(' ', '').lower()
    return course_key_to_id.get(course_name)


if __name__ == '__main__':
    for t in ['M. Sc', 'Msc', 'MSc']:
        print(get_course_name(t))

输出

M. Science
M. Science
M. Science

您还可以使用正则表达式替换所有非字母数字字符,如下所示:

  course_name = re.sub("[^0-9a-zA-Z]+", "", course_name).lower()

让我们尝试构建一个映射器:

import pandas as pd

src_df = pd.DataFrame({'Course Name': ['B.E.', 'ME CSE', 'English Literature',
                                       'Bsc. Economics Honrs.', 'BSC nursing'],
                       'Course Year': ['Fourth', 'Second', 'First',
                                       'Second', "Fourth"]})

# Define Aliases Here (Desired Format on Left, Options on Right)
aliases = {
    'B.E.': ['B.E', 'B.E.', 'BE', 'B.E(cse)',
             'Bachelor Of Engineering', 'BECSE', 'Be'],
    'B. Science': ['B.Sc', 'B. Sc', 'Bsc.', 'B.S.c']
}

# Generate Mapper from aliases
mapper = {alias: new_code for new_code, lst in aliases.items() for alias in lst}

# Apply Mapper to every Course Name
src_df['Course Name'] = src_df['Course Name'] \
    .apply(lambda x: pd.Series(map(mapper.get,
                                   filter(lambda v: v in x, mapper)))) \
    .fillna('misc')

# For Display
print(src_df.to_string())

输出:

  Course Name Course Year
0        B.E.      Fourth
1        misc      Second
2        misc       First
3  B. Science      Second
4        misc      Fourth

mapper = {alias: new_code for new_code, lst in aliases.items() for alias in lst}

根据上面定义的aliases构建字典。 aliases 列表就在那里,因为它比映射器更具可读性。

映射器:

{'B.E': 'B.E.', 'B.E.': 'B.E.',
 'BE': 'B.E.', 'B.E(cse)': 'B.E.',
 'Bachelor Of Engineering': 'B.E.',
 'BECSE': 'B.E.', 'Be': 'B.E.',
 'B.Sc': 'B. Science',
 'B. Sc': 'B. Science', 'Bsc.': 'B. Science',
 'B.S.c': 'B. Science'}

*注意映射器假定所有别名都是唯一的,并且没有一个别名用于多种情况。


在此之后,测试每个课程名称以查看该字符串是否包含在字典中

(参考 ):

src_df['Course Name'] = src_df['Course Name'] \
    .apply(lambda x: pd.Series(map(mapper.get,
                                   filter(lambda v: v in x, mapper))))
print(src_df)

src_df:

  Course Name Course Year
0        B.E.      Fourth
1         NaN      Second
2         NaN       First
3  B. Science      Second
4         NaN      Fourth

然后返回并填写默认大小写:

.fillna('misc')

这会将未映射的行替换为默认值 'misc'