为什么从字符串转换为整数，然后再转换回字符串会引发异常？

Question

我正在清理具有不同格式的年份数据。我的 DataFrame 的年份字段有七个可能的值：['2013-14','2014-15','2015-16','2016-17','22017','22018','22019']。我已经通过手动处理每个案例解决了问题，如下所示：

matchups_df.loc[matchups_df['SEASON_ID'] == '22017', 'SEASON_ID'] = '2017-18'
matchups_df.loc[matchups_df['SEASON_ID'] == '22018', 'SEASON_ID'] = '2018-19'
matchups_df.loc[matchups_df['SEASON_ID'] == '22019', 'SEASON_ID'] = '2019-20'

我的问题是， 为什么下面的代码会引发异常 ValueError: invalid literal for int() with base 10: '2016-17'？我已经从 np.where 中删除了相关部分，并将其用于 DataFrame 的过滤版本以仅处理必要的值，但它引发了相同的异常。显然，我在将字符串转换为int时犯了某种语法错误，但我还没有去诊断错误所在。

matchups_df.insert(loc = 1, column = 'Season', value = (
    np.where(
    (len(matchups_df.SEASON_ID) == 5),
        (
            (matchups_df.SEASON_ID[1:]) +
            "-" +
            (str((matchups_df.SEASON_ID[3:].astype(int))+1))
        ),

        matchups_df.SEASON_ID
    )
                                                        )
                                                            )

Answer 1

这里有必要使用 str 方法来检查长度 Series.str.len and indexing for get all values after first by str[1:], also because both Series are processing for convert to numbers is used to_numeric 以避免错误，如果没有匹配正确的格式:

matchups_df = pd.DataFrame({'SEASON_ID':['2013-14','2014-15','2015-16','2016-17',
                                         '22017','22018','22019'],
                            'col':range(7)})
    
print (matchups_df)
  SEASON_ID  col
0   2013-14    0
1   2014-15    1
2   2015-16    2
3   2016-17    3
4     22017    4
5     22018    5
6     22019    6

s = matchups_df.SEASON_ID.astype(str)

s1 = np.where(s.str.len() == 5, 
              s.str[1:] + "-" + pd.to_numeric(s.str[3:], errors='coerce')
                                  .fillna(0).astype(int).add(1).astype(str), 
              matchups_df.SEASON_ID)
matchups_df.insert(loc = 1, column = 'Season', value = s1)

print (matchups_df)
  SEASON_ID   Season  col
0   2013-14  2013-14    0
1   2014-15  2014-15    1
2   2015-16  2015-16    2
3   2016-17  2016-17    3
4     22017  2017-18    4
5     22018  2018-19    5
6     22019  2019-20    6

另一种具有自定义功能的解决方案：

def f(x):
    if len(x) == 5:
        return x[1:] + "-" + str(int(x[3:]) + 1)
    else:
        return x
s1 = matchups_df.SEASON_ID.astype(str).apply(f)

matchups_df.insert(loc = 1, column = 'Season', value = s1)
print (matchups_df)

  SEASON_ID   Season  col
0   2013-14  2013-14    0
1   2014-15  2014-15    1
2   2015-16  2015-16    2
3   2016-17  2016-17    3
4     22017  2017-18    4
5     22018  2018-19    5
6     22019  2019-20    6

Answer 2

这里的根本问题：

matchups_df.SEASON_ID[3:]

matchups_df.SEASON_ID 是整列（一个系列）。使用 [3:] 进行切片只会去掉前三行；但是你想去掉每个值的前三个字符。同样，len(matchups_df.SEASON_ID) == 5 不取决于单元格值（而是取决于 列的长度 ，因此所有单元格（从第四个开始 - 因此，第四个和第五个五）最终被处理 - 包括那些像 2016-17.

这样的字符串

为了获得您想要的行为，提供了 .str 助手，如@jezrael 的回答所示。

为什么从字符串转换为整数，然后再转换回字符串会引发异常？

Why does converting from string to integer, then back to string raise exception?

python

integer

pandas