基于 Python 中现有字符串列的新列

New column based on existing string column in Python

我的数据框看起来像:

School Term Students
A summer 2020 324
B spring 21 101
A summer/spring 201
F wintersem 44
C fall trimester 98
E 23

我需要添加一个新的 Termcode 列,该列采用 6 个值中的任何一个: 夏季,spring,秋季,冬季,多个,none 基于术语列中的相应值,即:

School Term Students Termcode
A summer 2020 324 summer
B spring 21 101 spring
A summer/spring 201 multiple
F wintersem 44 winter
C fall trimester 98 fall
E 23 none

您可以将正则表达式与 str.extractall 结合使用,并根据匹配项的数量填充值:

terms = ['summer', 'spring', 'fall', 'winter']
regex = r'('+'|'.join(terms)+r')'
# '(summer|spring|fall|winter)'

# extract values and set up grouper for next step
g = df['Term'].str.extractall(regex)[0].groupby(level=0)

# get the first match, replace with "multiple" if more than one
df['Termcode'] = g.first().mask(g.nunique().gt(1), 'multiple')

# fill the missing data (i.e. no match) with "none"
df['Termcode'] = df['Termcode'].fillna('none')

输出:

  School            Term  Students  Termcode
0      A     summer 2020       324    summer
1      B       spring 21       101    spring
2      A   summer/spring       201  multiple
3      F       wintersem        44    winter
4      C  fall trimester        98      fall
5      E             NaN        23      none

Series.findall

l = ['summer', 'spring', 'fall', 'winter']

s = df['Term'].str.findall(fr"{'|'.join(l)}")
df['Termcode'] = np.where(s.str.len() > 1, 'multiple', s.str[0])

  School            Term  Students  Termcode
0      A     summer 2020       324    summer
1      B       spring 21       101    spring
2      A   summer/spring       201  multiple
3      F       wintersem        44    winter
4      C  fall trimester        98      fall
5      E             NaN        23       NaN