更改 SparseDataFrame 中的 fill_values - 替换抛出 TypeError

Question

当前 pandas 版本：0.22

我有一个 SparseDataFrame。

A = pd.SparseDataFrame(
    [['a',0,0,'b'],
     [0,0,0,'c'],
     [0,0,0,0],
     [0,0,0,'a']])

A

   0  1  2  3
0  a  0  0  b
1  0  0  0  c
2  0  0  0  0
3  0  0  0  a

现在，填充值为 0。但是，我想将 fill_values 更改为 np.nan。我的第一直觉是调用 replace:

A.replace(0, np.nan)

但这给出了

TypeError: cannot convert int to an sparseblock

这并不能真正帮助我理解我做错了什么。

我知道我能做到

A.to_dense().replace(0, np.nan).to_sparse()

但是有没有更好的办法呢？还是我对稀疏数据帧的基本理解有缺陷？

Answer 1

这是我试过的

pd.SparseDataFrame(np.where(A==0, np.nan, A))

     0    1    2    3
0    a  NaN  NaN    b
1  NaN  NaN  NaN    c
2  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN    a

Answer 2

tl;dr : 这绝对是一个错误。
但请继续阅读，还有更多...

以下所有内容都适用于 pandas 0.20.3，但不适用于任何较新的版本：

A.replace(0,np.nan)
A.replace({0:np.nan})
A.replace([0],[np.nan])

等等...（你懂的）。

(以后所有代码都用pandas0.20.3完成)

但是，那些（连同我尝试过的大多数解决方法）之所以有效，是因为我们不小心做错了什么。如果我们这样做，您马上就会猜到：

A.density

1.0

这个SparseDataFrame居然是dense！
我们可以通过传递 default_fill_value=0 :

来解决这个问题

A = pd.SparseDataFrame(
     [['a',0,0,'b'],
     [0,0,0,'c'],
     [0,0,0,0],
     [0,0,0,'a']],default_fill_value=0)

现在 A.density 将按预期输出 0.25。

发生这种情况是因为初始化程序无法推断列的数据类型。引用自 pandas docs:

Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and bool dtypes are supported. Depending on the original dtype, fill_value default changes:

float64: np.nan

int64: 0

bool: False

但是我们的 SparseDataFrame 的数据类型是：

A.dtypes

0    object
1    object
2    object
3    object
dtype: object

这就是为什么 SparseDataFrame 无法决定使用哪个填充值，因此使用默认值 np.nan。

OK，现在我们有了一个 SparseDataFrame。让我们尝试替换其中的一些条目： <pre> A.replace('a','z') 0 1 2 3 0 z 0 0 b 1 0 0 0 c 2 0 0 0 0 3 0 0 0 z </pre> 奇怪的是： <pre> A.replace(0,np.nan) 0 1 2 3 0 a 0 0 b 1 0 0 0 c 2 0 0 0 0 3 0 0 0 a </pre> 如您所见，这是不正确的！
从我自己对不同版本 pandas 的实验来看，SparseDataFrame.replace() 似乎只适用于非填充值。要更改填充值，您有以下选项：

根据 pandas 文档，如果更改数据类型，将自动更改填充值。（这对我不起作用）。
转换为密集 DataFrame，进行替换，然后转换回 SparseDataFrame。
手动重建一个新的 SparseDataFrame，如，或通过将 default_fill_value 设置为新的填充值。

当我尝试最后一个选项时，发生了更奇怪的事情：

B = pd.SparseDataFrame(A,default_fill_value=np.nan)

B.density
0.25

B.default_fill_value
nan

到目前为止，还不错。但是……:

B
    0   1   2   3
0   a   0   0   b
1   0   0   0   c
2   0   0   0   0
3   0   0   0   a

起初我真的很震惊。这可能吗！？
继续，我试图查看列中发生了什么：

B[0]

0    a
1    0
2    0
3    0
Name: 0, dtype: object
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)

列的 dtype 是 object，但与之关联的 BlockIndex 的 dtype 是 int32，因此出现了奇怪的行为。
还有很多 "strange" 事情正在发生，但我会在这里停止。
综上所述，我可以说你应该避免使用 SparseDataFrame 直到它被完全重写 :).

更改 SparseDataFrame 中的 fill_values - 替换抛出 TypeError

Changing the fill_values in a SparseDataFrame - replace throws TypeError

python

sparse-matrix

pandas

sparse-dataframe