无法在 Pandas 中导入 Hive CSV
Hive CSV cannot be imported in Pandas
我正在使用漂亮的标准将一些 Hive 输出转储到 csv 中:
beeline -f my_script.hql --output_format=csv2 > data.csv
但是这个文件似乎不是正确的 CSV:
- unix系统无法读取
$ file data.csv
data.csv data
python
中的pandas
无法读取:
>>> import pandas as pd
>>> pd.read_csv("data.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/estergiadis/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/estergiadis/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 463, in _read
data = parser.read(nrows)
File "/Users/estergiadis/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1154, in read
ret = self._engine.read(nrows)
File "/Users/estergiadis/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2059, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 896, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 7
问题似乎与引号字符有关(它不是标准 CSV 中的 "
)。
我该如何解决这个问题?
Hive 使用\0 符号作为引号,您可以将其替换为'"'。
例如:
cat data.csv | sed 's/"/""/g' | tr '[=10=]' '"' > fixed_data.csv
我正在使用漂亮的标准将一些 Hive 输出转储到 csv 中:
beeline -f my_script.hql --output_format=csv2 > data.csv
但是这个文件似乎不是正确的 CSV:
- unix系统无法读取
$ file data.csv
data.csv data
python
中的pandas
无法读取:
>>> import pandas as pd
>>> pd.read_csv("data.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/estergiadis/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/estergiadis/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 463, in _read
data = parser.read(nrows)
File "/Users/estergiadis/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1154, in read
ret = self._engine.read(nrows)
File "/Users/estergiadis/opt/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2059, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 896, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 7
问题似乎与引号字符有关(它不是标准 CSV 中的 "
)。
我该如何解决这个问题?
Hive 使用\0 符号作为引号,您可以将其替换为'"'。
例如:
cat data.csv | sed 's/"/""/g' | tr '[=10=]' '"' > fixed_data.csv