使用字符串列输入将年和月转换为月 python
Conversion of years and months to months with string column input python
数据集示例:
experience
5 month
nan
1 months
8 month
17 months
8 year
11 years
1.7 year
3.1 years
15.7 months
18 year
2017.2 years
98.3 years
68 year
我有一个专栏,里面有申请者多年的经验。它非常混乱,我试图通过它并创建一个示例。我有数字后跟(月份或月份)和(年份)。
nan条目较多,应忽略。
目标是在几个月内创建一个专栏体验:
if nan
copy nan to the corresponding column
if the row has month or months
copy the number to the corresponding column
if year or years in the row and the number <55
the number shall be multiplied by 12 and copied to the corresponding column
else copy nan to the corresponding column
如何实现?
使用正则表达式的简单解决方案,保持透明性。
import numpy as np
df = pd.read_csv(io.StringIO("""experience
5 month
nan
1 months
8 month
17 months
8 year
11 years
1.7 year
3.1 years
15.7 months
18 year
2017.2 years
98.3 years
68 year"""))
df = df.assign(unit=lambda dfa: dfa["experience"].str.extract("([a-z]+)+"),
val=lambda dfa: dfa["experience"].str.extract("([0-9,\.]+)").astype(float),
months=lambda dfa: np.where(dfa["unit"].isin(["month","months"]), dfa["val"],
np.where(dfa["unit"].isin(["year","years"])
&dfa["val"].lt(55), dfa["val"]*12, np.nan)))
print(df.to_string(index=False))
输出
experience unit val months
5 month month 5.0 5.0
NaN NaN NaN NaN
1 months months 1.0 1.0
8 month month 8.0 8.0
17 months months 17.0 17.0
8 year year 8.0 96.0
11 years years 11.0 132.0
1.7 year year 1.7 20.4
3.1 years years 3.1 37.2
15.7 months months 15.7 15.7
18 year year 18.0 216.0
2017.2 years years 2017.2 NaN
98.3 years years 98.3 NaN
68 year year 68.0 NaN
使用熊猫数据框可能有更好的方法,但这是您要实现的目标吗?如果没有别的,您可能可以使用正则表达式。我没有添加条件 < 55 年,但我相信你可以解决这个问题。
import re
applicants = []
applicant1 = {'name': 'Lisa', 'experience': 'nan'}
applicant2 = {'name': 'Bill', 'experience': '3.1 months'}
applicant3 = {'name': 'Mandy', 'experience': '1 month'}
applicant4 = {'name': 'Geoff', 'experience': '6.7 years'}
applicant5 = {'name': 'Patricia', 'experience': '1 year'}
applicant6 = {'name': 'Kirsty', 'experience': '2017.2 years'}
applicants.append(applicant1)
applicants.append(applicant2)
applicants.append(applicant3)
applicants.append(applicant4)
applicants.append(applicant5)
applicants.append(applicant6)
print(applicants)
month_pattern = '^([\d]+[\.\d]*) month(s*)'
year_pattern = '^([\d]+[\.\d]*) year(s*)'
applicant_output = []
for applicant in applicants:
if applicant['experience'] == 'nan':
applicant_output.append(applicant)
else:
month = re.search(month_pattern, applicant['experience'])
if month is not None:
applicant_output.append(
{
'name': applicant['name'],
"exprience_months": month.group(1)
})
else:
year = re.search(year_pattern, applicant['experience'])
if year is not None:
months = str(float(year.group(1)) * 12)
applicant_output.append(
{
'name': applicant['name'],
"exprience_months": months
})
print(applicant_output)
这给出了输出:
[{'name': 'Lisa', 'experience': 'nan'}, {'name': 'Bill', 'experience': '3.1 months'}, {'name': 'Mandy', 'experience': '1 month'}, {'name': 'Geoff', 'experience': '6.7 years'}, {'name': 'Patricia', 'experience': '1 year'}, {'name': 'Kirsty', 'experience': '2017. years'}]
结果:
[{'name': 'Lisa', 'experience': 'nan'}, {'name': 'Bill', 'exprience_months': '3.1'}, {'name': 'Mandy', 'exprience_months': '1'}, {'name': 'Geoff', 'exprience_months': '80.4'}, {'name': 'Patricia', 'exprience_months': '12.0'}, {'name': 'Kirsty', 'exprience_months': '24206.4'}]
这假定格式一致(值、space、时间段)。您可以使用 split 来获得这两个部分。
df = pd.DataFrame({'experience': ['5 month', np.nan, '1 months', '8 month', '17 months', '8 year', '11 years']})
def get_values(x):
if pd.notnull(x):
val = int(x.split(' ')[0])
prd = x.split(' ')[1]
if prd in ['month', 'months']:
return val
elif prd in ['year', 'years'] and val < 55:
return val * 12
else:
return x
df['months'] = df.apply(lambda x: get_values(x.experience), axis=1)
输出:
experience months
0 5 month 5.0
1 NaN NaN
2 1 months 1.0
3 8 month 8.0
4 17 months 17.0
5 8 year 96.0
6 11 years 132.0
如果NaN的比例很高,可以在运行lambda函数之前先过滤
df[df.experience.notnull()].apply(lambda x: get_values(x.experience), axis=1)
temp_df
分离出month/year部分
temp_df = df['experience'].str.split('([A-Za-z]+)', expand=True)
temp_df = temp_df.loc[:, ~(temp_df == "").any(axis=0)] # deleting the extra column coming upon split
temp_df[0] = temp_df[0].astype(float)
temp_df
获取经验值倍数
multiplier = pd.Series([1] * len(temp_df), index=temp_df.index)
year_rows = temp_df[1].str.contains('year', case=False).fillna(False) # getting the rows which has year
temp_df.loc[(year_rows) & (temp_df[0]>=55), 0] = np.nan # converting exp value to nan where value is >= 55 and unit is year
multiplier[year_rows] = 12
df['experience_in_months'] = temp_df[0] * multiplier
df
my_dict = {'Experience': ['5 month', 'nan', '1 months', '8 month','17 months','8 year',
'11 years','1.7 year', '3.1 years', '15.7 months','18 year',
'2017.2 years', '98.3 years', '68 year']}
df = pd.DataFrame(my_dict)
# Create filter for month/months
month_filt = df['Experience'].str.contains('month')
# Filter DataFrame for rows that contain month/months
df['Months'] = df.loc[month_filt, 'Experience'].str.strip('month|months')
# Create filter for year/years
year_filt = df['Experience'].str.contains('year')
# Filter DataFrame for rows that contain year/years
df['Years'] = df.loc[year_filt, 'Experience'].str.strip('year|years')
# Fill NaN in Years column
df.loc[df['Years'].isna(),'Years'] = np.nan
# Convert Years to months
df.loc[df['Months'].isna(),'Months'] = df['Years'].astype('float') * 12
# Set years greater than 55 to NaN
df.loc[df['Years'].astype('float') > 55, 'Months'] = np.nan
Experience Months Years
0 5 month 5 NaN
1 nan NaN NaN
2 1 months 1 NaN
3 8 month 8 NaN
4 17 months 17 NaN
5 8 year 96 8
6 11 years 132 11
7 1.7 year 20.4 1.7
8 3.1 years 37.2 3.1
9 15.7 months 15.7 NaN
10 18 year 216 18
11 2017.2 yearsNaN 2017.2
12 98.3 years NaN 98.3
13 68 year NaN 68
数据集示例:
experience
5 month
nan
1 months
8 month
17 months
8 year
11 years
1.7 year
3.1 years
15.7 months
18 year
2017.2 years
98.3 years
68 year
我有一个专栏,里面有申请者多年的经验。它非常混乱,我试图通过它并创建一个示例。我有数字后跟(月份或月份)和(年份)。
nan条目较多,应忽略。
目标是在几个月内创建一个专栏体验:
if nan
copy nan to the corresponding column
if the row has month or months
copy the number to the corresponding column
if year or years in the row and the number <55
the number shall be multiplied by 12 and copied to the corresponding column
else copy nan to the corresponding column
如何实现?
使用正则表达式的简单解决方案,保持透明性。
import numpy as np
df = pd.read_csv(io.StringIO("""experience
5 month
nan
1 months
8 month
17 months
8 year
11 years
1.7 year
3.1 years
15.7 months
18 year
2017.2 years
98.3 years
68 year"""))
df = df.assign(unit=lambda dfa: dfa["experience"].str.extract("([a-z]+)+"),
val=lambda dfa: dfa["experience"].str.extract("([0-9,\.]+)").astype(float),
months=lambda dfa: np.where(dfa["unit"].isin(["month","months"]), dfa["val"],
np.where(dfa["unit"].isin(["year","years"])
&dfa["val"].lt(55), dfa["val"]*12, np.nan)))
print(df.to_string(index=False))
输出
experience unit val months
5 month month 5.0 5.0
NaN NaN NaN NaN
1 months months 1.0 1.0
8 month month 8.0 8.0
17 months months 17.0 17.0
8 year year 8.0 96.0
11 years years 11.0 132.0
1.7 year year 1.7 20.4
3.1 years years 3.1 37.2
15.7 months months 15.7 15.7
18 year year 18.0 216.0
2017.2 years years 2017.2 NaN
98.3 years years 98.3 NaN
68 year year 68.0 NaN
使用熊猫数据框可能有更好的方法,但这是您要实现的目标吗?如果没有别的,您可能可以使用正则表达式。我没有添加条件 < 55 年,但我相信你可以解决这个问题。
import re
applicants = []
applicant1 = {'name': 'Lisa', 'experience': 'nan'}
applicant2 = {'name': 'Bill', 'experience': '3.1 months'}
applicant3 = {'name': 'Mandy', 'experience': '1 month'}
applicant4 = {'name': 'Geoff', 'experience': '6.7 years'}
applicant5 = {'name': 'Patricia', 'experience': '1 year'}
applicant6 = {'name': 'Kirsty', 'experience': '2017.2 years'}
applicants.append(applicant1)
applicants.append(applicant2)
applicants.append(applicant3)
applicants.append(applicant4)
applicants.append(applicant5)
applicants.append(applicant6)
print(applicants)
month_pattern = '^([\d]+[\.\d]*) month(s*)'
year_pattern = '^([\d]+[\.\d]*) year(s*)'
applicant_output = []
for applicant in applicants:
if applicant['experience'] == 'nan':
applicant_output.append(applicant)
else:
month = re.search(month_pattern, applicant['experience'])
if month is not None:
applicant_output.append(
{
'name': applicant['name'],
"exprience_months": month.group(1)
})
else:
year = re.search(year_pattern, applicant['experience'])
if year is not None:
months = str(float(year.group(1)) * 12)
applicant_output.append(
{
'name': applicant['name'],
"exprience_months": months
})
print(applicant_output)
这给出了输出:
[{'name': 'Lisa', 'experience': 'nan'}, {'name': 'Bill', 'experience': '3.1 months'}, {'name': 'Mandy', 'experience': '1 month'}, {'name': 'Geoff', 'experience': '6.7 years'}, {'name': 'Patricia', 'experience': '1 year'}, {'name': 'Kirsty', 'experience': '2017. years'}]
结果:
[{'name': 'Lisa', 'experience': 'nan'}, {'name': 'Bill', 'exprience_months': '3.1'}, {'name': 'Mandy', 'exprience_months': '1'}, {'name': 'Geoff', 'exprience_months': '80.4'}, {'name': 'Patricia', 'exprience_months': '12.0'}, {'name': 'Kirsty', 'exprience_months': '24206.4'}]
这假定格式一致(值、space、时间段)。您可以使用 split 来获得这两个部分。
df = pd.DataFrame({'experience': ['5 month', np.nan, '1 months', '8 month', '17 months', '8 year', '11 years']})
def get_values(x):
if pd.notnull(x):
val = int(x.split(' ')[0])
prd = x.split(' ')[1]
if prd in ['month', 'months']:
return val
elif prd in ['year', 'years'] and val < 55:
return val * 12
else:
return x
df['months'] = df.apply(lambda x: get_values(x.experience), axis=1)
输出:
experience months
0 5 month 5.0
1 NaN NaN
2 1 months 1.0
3 8 month 8.0
4 17 months 17.0
5 8 year 96.0
6 11 years 132.0
如果NaN的比例很高,可以在运行lambda函数之前先过滤
df[df.experience.notnull()].apply(lambda x: get_values(x.experience), axis=1)
temp_df
分离出month/year部分
temp_df = df['experience'].str.split('([A-Za-z]+)', expand=True)
temp_df = temp_df.loc[:, ~(temp_df == "").any(axis=0)] # deleting the extra column coming upon split
temp_df[0] = temp_df[0].astype(float)
temp_df
获取经验值倍数
multiplier = pd.Series([1] * len(temp_df), index=temp_df.index)
year_rows = temp_df[1].str.contains('year', case=False).fillna(False) # getting the rows which has year
temp_df.loc[(year_rows) & (temp_df[0]>=55), 0] = np.nan # converting exp value to nan where value is >= 55 and unit is year
multiplier[year_rows] = 12
df['experience_in_months'] = temp_df[0] * multiplier
df
my_dict = {'Experience': ['5 month', 'nan', '1 months', '8 month','17 months','8 year',
'11 years','1.7 year', '3.1 years', '15.7 months','18 year',
'2017.2 years', '98.3 years', '68 year']}
df = pd.DataFrame(my_dict)
# Create filter for month/months
month_filt = df['Experience'].str.contains('month')
# Filter DataFrame for rows that contain month/months
df['Months'] = df.loc[month_filt, 'Experience'].str.strip('month|months')
# Create filter for year/years
year_filt = df['Experience'].str.contains('year')
# Filter DataFrame for rows that contain year/years
df['Years'] = df.loc[year_filt, 'Experience'].str.strip('year|years')
# Fill NaN in Years column
df.loc[df['Years'].isna(),'Years'] = np.nan
# Convert Years to months
df.loc[df['Months'].isna(),'Months'] = df['Years'].astype('float') * 12
# Set years greater than 55 to NaN
df.loc[df['Years'].astype('float') > 55, 'Months'] = np.nan
Experience Months Years
0 5 month 5 NaN
1 nan NaN NaN
2 1 months 1 NaN
3 8 month 8 NaN
4 17 months 17 NaN
5 8 year 96 8
6 11 years 132 11
7 1.7 year 20.4 1.7
8 3.1 years 37.2 3.1
9 15.7 months 15.7 NaN
10 18 year 216 18
11 2017.2 yearsNaN 2017.2
12 98.3 years NaN 98.3
13 68 year NaN 68