使用字符串列输入将年和月转换为月 python

Conversion of years and months to months with string column input python

数据集示例:

experience

5 month
nan
1 months
8 month
17 months
8 year
11 years
1.7 year
3.1 years
15.7 months
18 year
2017.2 years
98.3 years
68 year

我有一个专栏,里面有申请者多年的经验。它非常混乱,我试图通过它并创建一个示例。我有数字后跟(月份或月份)和(年份)。

nan条目较多,应忽略。

目标是在几个月内创建一个专栏体验:

if nan
  copy nan to the corresponding column
if the row has month or months 
  copy the number to the corresponding column
if year or years in the row and the number <55 
  the number shall be multiplied by 12 and copied to the corresponding column
else copy nan to the corresponding column

如何实现?

使用正则表达式的简单解决方案,保持透明性。

import numpy as np
df = pd.read_csv(io.StringIO("""experience

5 month
nan
1 months
8 month
17 months
8 year
11 years
1.7 year
3.1 years
15.7 months
18 year
2017.2 years
98.3 years
68 year"""))

df = df.assign(unit=lambda dfa: dfa["experience"].str.extract("([a-z]+)+"),
         val=lambda dfa: dfa["experience"].str.extract("([0-9,\.]+)").astype(float),
         months=lambda dfa: np.where(dfa["unit"].isin(["month","months"]), dfa["val"],
                                    np.where(dfa["unit"].isin(["year","years"])
                                             &dfa["val"].lt(55), dfa["val"]*12, np.nan)))

print(df.to_string(index=False))

输出

   experience    unit     val  months
      5 month   month     5.0     5.0
          NaN     NaN     NaN     NaN
     1 months  months     1.0     1.0
      8 month   month     8.0     8.0
    17 months  months    17.0    17.0
       8 year    year     8.0    96.0
     11 years   years    11.0   132.0
     1.7 year    year     1.7    20.4
    3.1 years   years     3.1    37.2
  15.7 months  months    15.7    15.7
      18 year    year    18.0   216.0
 2017.2 years   years  2017.2     NaN
   98.3 years   years    98.3     NaN
      68 year    year    68.0     NaN

使用熊猫数据框可能有更好的方法,但这是您要实现的目标吗?如果没有别的,您可能可以使用正则表达式。我没有添加条件 < 55 年,但我相信你可以解决这个问题。

import re
applicants = []

applicant1 = {'name': 'Lisa', 'experience': 'nan'}
applicant2 = {'name': 'Bill', 'experience': '3.1 months'}
applicant3 = {'name': 'Mandy', 'experience': '1 month'}
applicant4 = {'name': 'Geoff', 'experience': '6.7 years'}
applicant5 = {'name': 'Patricia', 'experience': '1 year'}
applicant6 = {'name': 'Kirsty', 'experience': '2017.2 years'}

applicants.append(applicant1)
applicants.append(applicant2)
applicants.append(applicant3)
applicants.append(applicant4)
applicants.append(applicant5)
applicants.append(applicant6)

print(applicants)

month_pattern = '^([\d]+[\.\d]*) month(s*)'
year_pattern = '^([\d]+[\.\d]*) year(s*)'

applicant_output = []

for applicant in applicants:
    if applicant['experience'] == 'nan':
        applicant_output.append(applicant)
    else:
        month = re.search(month_pattern, applicant['experience'])
        if month is not None:
            applicant_output.append(
                {
                    'name': applicant['name'],
                    "exprience_months": month.group(1)
                })
        else:
            year = re.search(year_pattern, applicant['experience'])
            if year is not None:
                months = str(float(year.group(1)) * 12)
                applicant_output.append(
                    {
                        'name': applicant['name'],
                        "exprience_months": months
                    })

print(applicant_output)

这给出了输出:

[{'name': 'Lisa', 'experience': 'nan'}, {'name': 'Bill', 'experience': '3.1 months'}, {'name': 'Mandy', 'experience': '1 month'}, {'name': 'Geoff', 'experience': '6.7 years'}, {'name': 'Patricia', 'experience': '1 year'}, {'name': 'Kirsty', 'experience': '2017. years'}]

结果:

[{'name': 'Lisa', 'experience': 'nan'}, {'name': 'Bill', 'exprience_months': '3.1'}, {'name': 'Mandy', 'exprience_months': '1'}, {'name': 'Geoff', 'exprience_months': '80.4'}, {'name': 'Patricia', 'exprience_months': '12.0'}, {'name': 'Kirsty', 'exprience_months': '24206.4'}]

这假定格式一致(值、space、时间段)。您可以使用 split 来获得这两个部分。

df = pd.DataFrame({'experience': ['5 month', np.nan, '1 months', '8 month', '17 months', '8 year', '11 years']})

def get_values(x):
    if pd.notnull(x):
        val = int(x.split(' ')[0])
        prd = x.split(' ')[1]
        if prd in ['month', 'months']:
            return val
        elif prd in ['year', 'years'] and val < 55: 
            return val * 12
    else:
        return x

df['months'] = df.apply(lambda x: get_values(x.experience), axis=1)  

输出:

  experience  months
0    5 month     5.0
1        NaN     NaN
2   1 months     1.0
3    8 month     8.0
4  17 months    17.0
5     8 year    96.0
6   11 years   132.0

如果NaN的比例很高,可以在运行lambda函数之前先过滤

df[df.experience.notnull()].apply(lambda x: get_values(x.experience), axis=1)

temp_df分离出month/year部分

temp_df = df['experience'].str.split('([A-Za-z]+)', expand=True)
temp_df = temp_df.loc[:, ~(temp_df == "").any(axis=0)]  # deleting the extra column coming upon split
temp_df[0] = temp_df[0].astype(float)
temp_df

获取经验值倍数

multiplier = pd.Series([1] * len(temp_df), index=temp_df.index)
year_rows = temp_df[1].str.contains('year', case=False).fillna(False)  # getting the rows which has year
temp_df.loc[(year_rows) & (temp_df[0]>=55), 0] = np.nan  # converting exp value to nan where value is >= 55 and unit is year
multiplier[year_rows] = 12
df['experience_in_months'] = temp_df[0] * multiplier
df

my_dict = {'Experience': ['5 month', 'nan', '1 months', '8 month','17 months','8 year',
                          '11 years','1.7 year', '3.1 years', '15.7 months','18 year',
                          '2017.2 years', '98.3 years', '68 year']}

df = pd.DataFrame(my_dict)

# Create filter for month/months
month_filt = df['Experience'].str.contains('month')

# Filter DataFrame for rows that contain month/months 
df['Months'] = df.loc[month_filt, 'Experience'].str.strip('month|months')

# Create filter for year/years
year_filt = df['Experience'].str.contains('year')

# Filter DataFrame for rows that contain year/years
df['Years'] = df.loc[year_filt, 'Experience'].str.strip('year|years')

# Fill NaN in Years column
df.loc[df['Years'].isna(),'Years'] = np.nan

# Convert Years to months
df.loc[df['Months'].isna(),'Months'] = df['Years'].astype('float') * 12

# Set years greater than 55 to NaN
df.loc[df['Years'].astype('float') > 55, 'Months'] = np.nan

    Experience  Months  Years
0   5 month     5       NaN
1   nan         NaN     NaN
2   1 months    1       NaN
3   8 month     8       NaN
4   17 months   17      NaN
5   8 year      96      8
6   11 years    132     11
7   1.7 year    20.4    1.7
8   3.1 years   37.2    3.1
9   15.7 months 15.7    NaN
10  18 year     216     18
11  2017.2 yearsNaN 2017.2
12  98.3 years  NaN     98.3
13  68 year     NaN     68