pyspark toPandas() IndexError: index is out of bounds
pyspark toPandas() IndexError: index is out of bounds
我遇到了来自 Jupyt 的 pyspark 的 .toPandas() 方法 运行 的奇怪行为。例如,如果我尝试这样做:
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
{"Category": 'Category B', "ID": 2, "Value": 30.10},
{"Category": 'Category C', "ID": 3, "Value": 100.01}
]
# Create data frame (where spark is a SparkSession)
df = spark.createDataFrame(data)
df.show()
我能够成功创建 pyspark 数据框。但是,当转换为 pandas 我得到 IndexError: index is out of bounds:
IndexError Traceback (most recent call last)
<path_to_python>/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
--> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()
<path_to_python>/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
400 if cls is not object \
401 and callable(cls.__dict__.get('__repr__')):
--> 402 return _repr_pprint(obj, self, cycle)
403
404 return _default_pprint(obj, self, cycle)
<path_to_python>/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
695 """A pprint that just redirects to the normal repr function."""
696 # Find newlines and replace them with p.break_()
--> 697 output = repr(obj)
698 for idx,output_line in enumerate(output.splitlines()):
699 if idx:
<path_to_python>/lib/python3.7/site-packages/pandas/core/base.py in __repr__(self)
76 Yields Bytestring in Py2, Unicode String in py3.
77 """
---> 78 return str(self)
79
80
<path_to_python>/lib/python3.7/site-packages/pandas/core/base.py in __str__(self)
55
56 if compat.PY3:
---> 57 return self.__unicode__()
58 return self.__bytes__()
59
<path_to_python>/lib/python3.7/site-packages/pandas/core/frame.py in __unicode__(self)
632 width = None
633 self.to_string(buf=buf, max_rows=max_rows, max_cols=max_cols,
--> 634 line_width=width, show_dimensions=show_dimensions)
635
636 return buf.getvalue()
<path_to_python>/lib/python3.7/site-packages/pandas/core/frame.py in to_string(self, buf, columns, col_space, header, index, na_rep, formatters, float_format, sparsify, index_names, justify, max_rows, max_cols, show_dimensions, decimal, line_width)
719 decimal=decimal,
720 line_width=line_width)
--> 721 formatter.to_string()
722
723 if buf is None:
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in to_string(self)
596 else:
597
--> 598 strcols = self._to_str_columns()
599 if self.line_width is None: # no need to wrap around just print
600 # the whole frame
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in _to_str_columns(self)
527 str_columns = [[label] for label in self.header]
528 else:
--> 529 str_columns = self._get_formatted_column_labels(frame)
530
531 stringified = []
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in _get_formatted_column_labels(self, frame)
770 need_leadsp[x] else x]
771 for i, (col, x) in enumerate(zip(columns,
--> 772 fmt_columns))]
773
774 if self.show_row_idx_names:
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in <listcomp>(.0)
769 str_columns = [[' ' + x if not self._get_formatter(i) and
770 need_leadsp[x] else x]
--> 771 for i, (col, x) in enumerate(zip(columns,
772 fmt_columns))]
773
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in _get_formatter(self, i)
362 else:
363 if is_integer(i) and i not in self.columns:
--> 364 i = self.columns[i]
365 return self.formatters.get(i, None)
366
<path_to_python>/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
3956 if is_scalar(key):
3957 key = com.cast_scalar_indexer(key)
-> 3958 return getitem(key)
3959
3960 if isinstance(key, slice):
IndexError: index 3 is out of bounds for axis 0 with size 3
我不确定问题出在哪里,我已经使用了很多次都没有问题,但是这次我尝试了一个新环境并且遇到了这个问题。如果它可以帮助我的配置是:
Python: 3.7.6;
Pandas: 0.24.2;
PySpark: 2.4.5
有什么想法吗?
谢谢:)
我发现了问题。试图最小化代码以重现错误我忽略了我正在添加 pandas 设置:
pd.set_option('display.max_columns', -1)
这导致了与正在转换的数据帧无关的错误。要修复它,我只是指定了正数的列或 None.
我遇到了来自 Jupyt 的 pyspark 的 .toPandas() 方法 运行 的奇怪行为。例如,如果我尝试这样做:
data = [{"Category": 'Category A', "ID": 1, "Value": 12.40},
{"Category": 'Category B', "ID": 2, "Value": 30.10},
{"Category": 'Category C', "ID": 3, "Value": 100.01}
]
# Create data frame (where spark is a SparkSession)
df = spark.createDataFrame(data)
df.show()
我能够成功创建 pyspark 数据框。但是,当转换为 pandas 我得到 IndexError: index is out of bounds:
IndexError Traceback (most recent call last)
<path_to_python>/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
--> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()
<path_to_python>/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
400 if cls is not object \
401 and callable(cls.__dict__.get('__repr__')):
--> 402 return _repr_pprint(obj, self, cycle)
403
404 return _default_pprint(obj, self, cycle)
<path_to_python>/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
695 """A pprint that just redirects to the normal repr function."""
696 # Find newlines and replace them with p.break_()
--> 697 output = repr(obj)
698 for idx,output_line in enumerate(output.splitlines()):
699 if idx:
<path_to_python>/lib/python3.7/site-packages/pandas/core/base.py in __repr__(self)
76 Yields Bytestring in Py2, Unicode String in py3.
77 """
---> 78 return str(self)
79
80
<path_to_python>/lib/python3.7/site-packages/pandas/core/base.py in __str__(self)
55
56 if compat.PY3:
---> 57 return self.__unicode__()
58 return self.__bytes__()
59
<path_to_python>/lib/python3.7/site-packages/pandas/core/frame.py in __unicode__(self)
632 width = None
633 self.to_string(buf=buf, max_rows=max_rows, max_cols=max_cols,
--> 634 line_width=width, show_dimensions=show_dimensions)
635
636 return buf.getvalue()
<path_to_python>/lib/python3.7/site-packages/pandas/core/frame.py in to_string(self, buf, columns, col_space, header, index, na_rep, formatters, float_format, sparsify, index_names, justify, max_rows, max_cols, show_dimensions, decimal, line_width)
719 decimal=decimal,
720 line_width=line_width)
--> 721 formatter.to_string()
722
723 if buf is None:
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in to_string(self)
596 else:
597
--> 598 strcols = self._to_str_columns()
599 if self.line_width is None: # no need to wrap around just print
600 # the whole frame
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in _to_str_columns(self)
527 str_columns = [[label] for label in self.header]
528 else:
--> 529 str_columns = self._get_formatted_column_labels(frame)
530
531 stringified = []
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in _get_formatted_column_labels(self, frame)
770 need_leadsp[x] else x]
771 for i, (col, x) in enumerate(zip(columns,
--> 772 fmt_columns))]
773
774 if self.show_row_idx_names:
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in <listcomp>(.0)
769 str_columns = [[' ' + x if not self._get_formatter(i) and
770 need_leadsp[x] else x]
--> 771 for i, (col, x) in enumerate(zip(columns,
772 fmt_columns))]
773
<path_to_python>/lib/python3.7/site-packages/pandas/io/formats/format.py in _get_formatter(self, i)
362 else:
363 if is_integer(i) and i not in self.columns:
--> 364 i = self.columns[i]
365 return self.formatters.get(i, None)
366
<path_to_python>/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
3956 if is_scalar(key):
3957 key = com.cast_scalar_indexer(key)
-> 3958 return getitem(key)
3959
3960 if isinstance(key, slice):
IndexError: index 3 is out of bounds for axis 0 with size 3
我不确定问题出在哪里,我已经使用了很多次都没有问题,但是这次我尝试了一个新环境并且遇到了这个问题。如果它可以帮助我的配置是:
Python: 3.7.6; Pandas: 0.24.2; PySpark: 2.4.5
有什么想法吗? 谢谢:)
我发现了问题。试图最小化代码以重现错误我忽略了我正在添加 pandas 设置:
pd.set_option('display.max_columns', -1)
这导致了与正在转换的数据帧无关的错误。要修复它,我只是指定了正数的列或 None.