转换列 pandas 数据框 python 时出错 3

Error during conversion column pandas data frame python 3

我对 pandas 有很大的疑问。我有一个包含

的重要数据框
Ref_id   PRICE    YEAR  MONTH BRAND
100000   '5000'  '2012' '4'   'FORD'
100001   '10000' '2015' '5'   'MERCEDES'
...

我想转换我的 PRICE、YEAR 和 MONTH 列,但是当我在列上使用 .astype(int) 或 .apply(lambda x : int(x)) 时,我收到了 ValueError。我的数据框的长度是 180 万行。

ValueError: invalid literal for int() with base 10: 'PRICE'

所以不明白为什么pandas要转换列名

你能解释一下为什么吗?

最佳,

摄氏度。

试试这个:

In [59]: cols = 'PRICE  YEAR  MONTH'.split()

In [60]: cols
Out[60]: ['PRICE', 'YEAR', 'MONTH']

In [61]: for c in cols:
    ...:     df[c] = pd.to_numeric(df[c], errors='coerce')
    ...:

In [62]: df
Out[62]:
   Ref_id    PRICE  YEAR  MONTH     BRAND
0  100000   5000.0  2012      4      FORD
1  100001  10000.0  2015      5  MERCEDES
2  100002      NaN  2016      6      AUDI

重现您的错误:

In [65]: df
Out[65]:
   Ref_id  PRICE  YEAR  MONTH     BRAND
0  100000   5000  2012      4      FORD
1  100001  10000  2015      5  MERCEDES
2  100002  PRICE  2016      6      AUDI  # pay attention at `PRICE` value !!!

In [66]: df['PRICE'].astype(int)
...
skipped
...
ValueError: invalid literal for int() with base 10: 'PRICE'

因为 很可能您的数据集中有 "bad"(意外)值。

您可以使用以下技术之一来清理它:

In [155]: df
Out[155]:
   Ref_id  PRICE  YEAR  MONTH     BRAND
0  100000   5000  2012      4      FORD
1  100001  10000  2015      5  MERCEDES
2  Ref_id  PRICE  YEAR  MONTH     BRAND
3  100002  15000  2016      5      AUDI

In [156]: df.dtypes
Out[156]:
Ref_id    object
PRICE     object
YEAR      object
MONTH     object
BRAND     object
dtype: object

In [157]: df = df.drop(df.loc[df.PRICE == 'PRICE'].index)

In [158]: df
Out[158]:
   Ref_id  PRICE  YEAR MONTH     BRAND
0  100000   5000  2012     4      FORD
1  100001  10000  2015     5  MERCEDES
3  100002  15000  2016     5      AUDI

In [159]: for c in cols:
     ...:     df[c] = pd.to_numeric(df[c], errors='coerce')
     ...:

In [160]: df
Out[160]:
   Ref_id  PRICE  YEAR  MONTH     BRAND
0  100000   5000  2012      4      FORD
1  100001  10000  2015      5  MERCEDES
3  100002  15000  2016      5      AUDI

In [161]: df.dtypes
Out[161]:
Ref_id    object
PRICE      int64
YEAR       int64
MONTH      int64
BRAND     object
dtype: object

或简单地:

In [159]: for c in cols:
     ...:     df[c] = pd.to_numeric(df[c], errors='coerce')
     ...:

In [165]: df
Out[165]:
   Ref_id    PRICE    YEAR  MONTH     BRAND
0  100000   5000.0  2012.0    4.0      FORD
1  100001  10000.0  2015.0    5.0  MERCEDES
2  Ref_id      NaN     NaN    NaN     BRAND
3  100002  15000.0  2016.0    5.0      AUDI

然后是 .dropna(how='any') 如果您知道原始数据集中没有 NaN

In [166]: df = df.dropna(how='any')

In [167]: df
Out[167]:
   Ref_id    PRICE    YEAR  MONTH     BRAND
0  100000   5000.0  2012.0    4.0      FORD
1  100001  10000.0  2015.0    5.0  MERCEDES
3  100002  15000.0  2016.0    5.0      AUDI