基于 Python 中现有字符串列的新列
New column based on existing string column in Python
我的数据框看起来像:
School
Term
Students
A
summer 2020
324
B
spring 21
101
A
summer/spring
201
F
wintersem
44
C
fall trimester
98
E
23
我需要添加一个新的 Termcode 列,该列采用 6 个值中的任何一个:
夏季,spring,秋季,冬季,多个,none 基于术语列中的相应值,即:
School
Term
Students
Termcode
A
summer 2020
324
summer
B
spring 21
101
spring
A
summer/spring
201
multiple
F
wintersem
44
winter
C
fall trimester
98
fall
E
23
none
您可以将正则表达式与 str.extractall
结合使用,并根据匹配项的数量填充值:
terms = ['summer', 'spring', 'fall', 'winter']
regex = r'('+'|'.join(terms)+r')'
# '(summer|spring|fall|winter)'
# extract values and set up grouper for next step
g = df['Term'].str.extractall(regex)[0].groupby(level=0)
# get the first match, replace with "multiple" if more than one
df['Termcode'] = g.first().mask(g.nunique().gt(1), 'multiple')
# fill the missing data (i.e. no match) with "none"
df['Termcode'] = df['Termcode'].fillna('none')
输出:
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 none
Series.findall
l = ['summer', 'spring', 'fall', 'winter']
s = df['Term'].str.findall(fr"{'|'.join(l)}")
df['Termcode'] = np.where(s.str.len() > 1, 'multiple', s.str[0])
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 NaN
我的数据框看起来像:
School | Term | Students |
---|---|---|
A | summer 2020 | 324 |
B | spring 21 | 101 |
A | summer/spring | 201 |
F | wintersem | 44 |
C | fall trimester | 98 |
E | 23 |
我需要添加一个新的 Termcode 列,该列采用 6 个值中的任何一个: 夏季,spring,秋季,冬季,多个,none 基于术语列中的相应值,即:
School | Term | Students | Termcode |
---|---|---|---|
A | summer 2020 | 324 | summer |
B | spring 21 | 101 | spring |
A | summer/spring | 201 | multiple |
F | wintersem | 44 | winter |
C | fall trimester | 98 | fall |
E | 23 | none |
您可以将正则表达式与 str.extractall
结合使用,并根据匹配项的数量填充值:
terms = ['summer', 'spring', 'fall', 'winter']
regex = r'('+'|'.join(terms)+r')'
# '(summer|spring|fall|winter)'
# extract values and set up grouper for next step
g = df['Term'].str.extractall(regex)[0].groupby(level=0)
# get the first match, replace with "multiple" if more than one
df['Termcode'] = g.first().mask(g.nunique().gt(1), 'multiple')
# fill the missing data (i.e. no match) with "none"
df['Termcode'] = df['Termcode'].fillna('none')
输出:
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 none
Series.findall
l = ['summer', 'spring', 'fall', 'winter']
s = df['Term'].str.findall(fr"{'|'.join(l)}")
df['Termcode'] = np.where(s.str.len() > 1, 'multiple', s.str[0])
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 NaN