PandasPython如何让空值不存储在HBase中?
How to let null values are not stored in HBase in Pandas Python?
我有一些示例数据如下:
test_a test_b test_c test_d test_date
-------------------------------------------------
1 a 500 0.1 111 20191101
2 a NaN 0.2 NaN 20191102
3 a 200 0.1 111 20191103
4 a 400 NaN 222 20191104
5 a NaN 0.2 333 20191105
我想把这些数据存储在Hbase中,我用下面的代码来实现。
from test.db import impala, hbasecon, HiveClient
import pandas as pd
sql = """
SELECT test_a
,test_b
,test_c
,test_d
,test_date
FROM table_test
"""
conn_impa = HiveClient().getcon()
all_df = pd.read_sql(sql=sql, con=conn_impa, chunksize=50000)
num = 0
for df in all_df:
df = df.fillna('')
df["s"] = df["test_d"] + df["test_date"]
tmp_num = len(df)
if len(df) > 0:
with hintltable.batch(batch_size=1000) as b:
df.apply(lambda row: b.put(row["k"], {
'test:test_a': str(row["test_a"]),
'test:test_b': str(row["test_b"]),
'test:test_c': str(row["test_c"]),
}), axis=1)
num += len(df)
当我查询 Hbase get 'test', 'a201911012'
时,我得到以下结果:
COLUMN CELL
test:test_a timestamp=1578389750838, value=a
test:test_b timestamp=1578389788675, value=
test:test_c timestamp=1578389775471, value=0.2
test:test_d timestamp=1578449081388, value=
在PandasPython中如何确保空值不存储在HBase中?我们不需要 null 或空字符串值,我们的预期结果是:
COLUMN CELL
test:test_a timestamp=1578389750838, value=a
test:test_c timestamp=1578389775471, value=0.2
您应该可以通过创建一个自定义函数并在您的 lambda 函数中调用它来完成此操作。例如你可以有一个函数 -
def makeEntry(a, b, c):
entrydict = {}
## using the fact that NaN == NaN is supposed to be False and empty strings are Falsy
if(a==a and a):
entrydict ["test:test_a"] = str(a)
if(b==b and b):
entrydict ["test:test_b"] = str(b)
if(c==c and c):
entrydict ["test:test_c"] = str(c)
return entrydict
然后您可以将应用函数更改为 -
df.apply(lambda row: b.put(row["k"],
makeEntry(row["test_a"],row["test_b"],row["test_c"])), axis=1)
这样你只输入不是 NaN
的值而不是所有值。
我有一些示例数据如下:
test_a test_b test_c test_d test_date
-------------------------------------------------
1 a 500 0.1 111 20191101
2 a NaN 0.2 NaN 20191102
3 a 200 0.1 111 20191103
4 a 400 NaN 222 20191104
5 a NaN 0.2 333 20191105
我想把这些数据存储在Hbase中,我用下面的代码来实现。
from test.db import impala, hbasecon, HiveClient
import pandas as pd
sql = """
SELECT test_a
,test_b
,test_c
,test_d
,test_date
FROM table_test
"""
conn_impa = HiveClient().getcon()
all_df = pd.read_sql(sql=sql, con=conn_impa, chunksize=50000)
num = 0
for df in all_df:
df = df.fillna('')
df["s"] = df["test_d"] + df["test_date"]
tmp_num = len(df)
if len(df) > 0:
with hintltable.batch(batch_size=1000) as b:
df.apply(lambda row: b.put(row["k"], {
'test:test_a': str(row["test_a"]),
'test:test_b': str(row["test_b"]),
'test:test_c': str(row["test_c"]),
}), axis=1)
num += len(df)
当我查询 Hbase get 'test', 'a201911012'
时,我得到以下结果:
COLUMN CELL
test:test_a timestamp=1578389750838, value=a
test:test_b timestamp=1578389788675, value=
test:test_c timestamp=1578389775471, value=0.2
test:test_d timestamp=1578449081388, value=
在PandasPython中如何确保空值不存储在HBase中?我们不需要 null 或空字符串值,我们的预期结果是:
COLUMN CELL
test:test_a timestamp=1578389750838, value=a
test:test_c timestamp=1578389775471, value=0.2
您应该可以通过创建一个自定义函数并在您的 lambda 函数中调用它来完成此操作。例如你可以有一个函数 -
def makeEntry(a, b, c):
entrydict = {}
## using the fact that NaN == NaN is supposed to be False and empty strings are Falsy
if(a==a and a):
entrydict ["test:test_a"] = str(a)
if(b==b and b):
entrydict ["test:test_b"] = str(b)
if(c==c and c):
entrydict ["test:test_c"] = str(c)
return entrydict
然后您可以将应用函数更改为 -
df.apply(lambda row: b.put(row["k"],
makeEntry(row["test_a"],row["test_b"],row["test_c"])), axis=1)
这样你只输入不是 NaN
的值而不是所有值。