将数据上传到 Apache Hbase 时出现 Broken Pipe 错误
Broken Pipe error when uploading data to Apache Hbase
我目前正在尝试将大型 CSV 文件加载到 Apache hbase 中。 CSV 有 50,000 列宽和 15,000 行。 CSV 的值只是整数。
Hbase 集群 运行 在 AWS EMR 上,具有充足的内存 (244GB) 和计算能力(每个 32 个核心,4 个节点)。
我正在尝试使用此 python 脚本将数据加载到数据库中:
import happybase
import pandas as pd
connection = happybase.Connection('localhost')
familes = {
's': dict(in_memory=True)
}
#connection.delete_table('exon', disable=True)
connection.create_table('exon', familes)
table = connection.table('exon')
df = pd.read_csv('exon.csv', nrows=1000)
col = list(df)
col = col[1:]
for index, row in df.iterrows():
to_put = {}
for col_name in col:
to_put[('s:'+ col_name).encode('utf-8')] = str(row[col_name]).encode('utf-8')
print('putting: ' + str(row[0]))
table.put(row[0].encode('utf-8'), to_put)
当这个脚本运行时,只读取前几行,没有问题:
df = pd.read_csv('exon.csv', nrows=20)
但是,读取更多行会导致错误:
df = pd.read_csv('exon.csv', nrows=1000)
putting: F1S4_160106_001_B01
Traceback (most recent call last):
File "load.py", line 25, in <module>
table.put(row[0].encode('utf-8'), to_put)
File "/usr/local/lib/python3.6/site-packages/happybase/table.py", line 464, in put
batch.put(row, data)
File "/usr/local/lib/python3.6/site-packages/happybase/batch.py", line 137, in __exit__
self.send()
File "/usr/local/lib/python3.6/site-packages/happybase/batch.py", line 60, in send
self._table.connection.client.mutateRows(self._table.name, bms, {})
File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 200, in _req
self._send(_api, **kwargs)
File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 210, in _send
args.write(self._oprot)
File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 153, in write
oprot.write_struct(self)
File "thriftpy2/protocol/cybin/cybin.pyx", line 477, in cybin.TCyBinaryProtocol.write_struct
File "thriftpy2/protocol/cybin/cybin.pyx", line 474, in cybin.TCyBinaryProtocol.write_struct
File "thriftpy2/protocol/cybin/cybin.pyx", line 212, in cybin.write_struct
File "thriftpy2/protocol/cybin/cybin.pyx", line 356, in cybin.c_write_val
File "thriftpy2/protocol/cybin/cybin.pyx", line 115, in cybin.write_list
File "thriftpy2/protocol/cybin/cybin.pyx", line 362, in cybin.c_write_val
File "thriftpy2/protocol/cybin/cybin.pyx", line 212, in cybin.write_struct
File "thriftpy2/protocol/cybin/cybin.pyx", line 356, in cybin.c_write_val
File "thriftpy2/protocol/cybin/cybin.pyx", line 115, in cybin.write_list
File "thriftpy2/protocol/cybin/cybin.pyx", line 362, in cybin.c_write_val
File "thriftpy2/protocol/cybin/cybin.pyx", line 209, in cybin.write_struct
File "thriftpy2/protocol/cybin/cybin.pyx", line 71, in cybin.write_i08
File "thriftpy2/transport/buffered/cybuffered.pyx", line 55, in thriftpy2.transport.buffered.cybuffered.TCyBufferedTransport.c_write
File "thriftpy2/transport/buffered/cybuffered.pyx", line 80, in thriftpy2.transport.buffered.cybuffered.TCyBufferedTransport.c_flush
File "/usr/local/lib64/python3.6/site-packages/thriftpy2/transport/socket.py", line 136, in write
self.sock.sendall(buff)
BrokenPipeError: [Errno 32] Broken pipe
是不是一次插入的数据太多了?我也试过批量放置,同样的问题出现了。
发现我的错误 - 因为我在打开 HappyBase 连接后调用 pandas.read_csv
,连接超时。在打开连接之前调用 read_csv
解决了问题。
我目前正在尝试将大型 CSV 文件加载到 Apache hbase 中。 CSV 有 50,000 列宽和 15,000 行。 CSV 的值只是整数。
Hbase 集群 运行 在 AWS EMR 上,具有充足的内存 (244GB) 和计算能力(每个 32 个核心,4 个节点)。
我正在尝试使用此 python 脚本将数据加载到数据库中:
import happybase
import pandas as pd
connection = happybase.Connection('localhost')
familes = {
's': dict(in_memory=True)
}
#connection.delete_table('exon', disable=True)
connection.create_table('exon', familes)
table = connection.table('exon')
df = pd.read_csv('exon.csv', nrows=1000)
col = list(df)
col = col[1:]
for index, row in df.iterrows():
to_put = {}
for col_name in col:
to_put[('s:'+ col_name).encode('utf-8')] = str(row[col_name]).encode('utf-8')
print('putting: ' + str(row[0]))
table.put(row[0].encode('utf-8'), to_put)
当这个脚本运行时,只读取前几行,没有问题:
df = pd.read_csv('exon.csv', nrows=20)
但是,读取更多行会导致错误:
df = pd.read_csv('exon.csv', nrows=1000)
putting: F1S4_160106_001_B01
Traceback (most recent call last):
File "load.py", line 25, in <module>
table.put(row[0].encode('utf-8'), to_put)
File "/usr/local/lib/python3.6/site-packages/happybase/table.py", line 464, in put
batch.put(row, data)
File "/usr/local/lib/python3.6/site-packages/happybase/batch.py", line 137, in __exit__
self.send()
File "/usr/local/lib/python3.6/site-packages/happybase/batch.py", line 60, in send
self._table.connection.client.mutateRows(self._table.name, bms, {})
File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 200, in _req
self._send(_api, **kwargs)
File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 210, in _send
args.write(self._oprot)
File "/usr/local/lib64/python3.6/site-packages/thriftpy2/thrift.py", line 153, in write
oprot.write_struct(self)
File "thriftpy2/protocol/cybin/cybin.pyx", line 477, in cybin.TCyBinaryProtocol.write_struct
File "thriftpy2/protocol/cybin/cybin.pyx", line 474, in cybin.TCyBinaryProtocol.write_struct
File "thriftpy2/protocol/cybin/cybin.pyx", line 212, in cybin.write_struct
File "thriftpy2/protocol/cybin/cybin.pyx", line 356, in cybin.c_write_val
File "thriftpy2/protocol/cybin/cybin.pyx", line 115, in cybin.write_list
File "thriftpy2/protocol/cybin/cybin.pyx", line 362, in cybin.c_write_val
File "thriftpy2/protocol/cybin/cybin.pyx", line 212, in cybin.write_struct
File "thriftpy2/protocol/cybin/cybin.pyx", line 356, in cybin.c_write_val
File "thriftpy2/protocol/cybin/cybin.pyx", line 115, in cybin.write_list
File "thriftpy2/protocol/cybin/cybin.pyx", line 362, in cybin.c_write_val
File "thriftpy2/protocol/cybin/cybin.pyx", line 209, in cybin.write_struct
File "thriftpy2/protocol/cybin/cybin.pyx", line 71, in cybin.write_i08
File "thriftpy2/transport/buffered/cybuffered.pyx", line 55, in thriftpy2.transport.buffered.cybuffered.TCyBufferedTransport.c_write
File "thriftpy2/transport/buffered/cybuffered.pyx", line 80, in thriftpy2.transport.buffered.cybuffered.TCyBufferedTransport.c_flush
File "/usr/local/lib64/python3.6/site-packages/thriftpy2/transport/socket.py", line 136, in write
self.sock.sendall(buff)
BrokenPipeError: [Errno 32] Broken pipe
是不是一次插入的数据太多了?我也试过批量放置,同样的问题出现了。
发现我的错误 - 因为我在打开 HappyBase 连接后调用 pandas.read_csv
,连接超时。在打开连接之前调用 read_csv
解决了问题。