尽管使用 NOT NULL 和 <> '' 仍出现空白行

Question

我正在尝试从我的 table 中删除所有 empty/blank 个单元格。但是，在之后，我仍然有一些空白单元格，我尝试使用标题中提到的方法删除它们。

我试过NOT NULL和<> ''，同样，我试过>0。 None 其中似乎删除了空白单元格。我不确定它可能是什么其他类型。这些列是 varchar，因此很难确定它是什么。

貌似没有人遇到过这个，因为我一直找不到类似的文章或问题。这 table 是一个令人难以置信的混乱，因为到处都有明显的不一致。

我正在使用的语句：

SELECT * FROM table WHERE column is NOT NULL AND column <> ''

理想情况下，所有空白单元格都会消失，这样我就可以确保我的 Pandas df 是准确的。

我在 Python 中的代码在 table:

中找到了大约 2,000 个 "null" 条目

def enumerate_null_data(df):
    #pandas doesn't support blank strings or None distinguishments with isnull/isna, so we replace those with np.NaN
    #a data type that is consistent with its archictecture/is handled properly
    df['rfid_sent'].replace(['', None], np.nan, inplace=True)
    df['rfid_received'].replace(['', None], np.nan, inplace=True)
    #dataframe that no longer contains the null values
    sent_null_removed = df.dropna(subset=['rfid_sent'])
    received_null_removed = df.dropna(subset=['rfid_received'])

    #create a dataframe that has all of the entries that were removed from sent_null_removed/received_null_removed
    #and count them (get the length of that dataframe)
    num_sent_null_removed = len(df[~df.index.isin(sent_null_removed.index)].index)
    num_received_null_removed = len(df[~df.index.isin(received_null_removed.index)].index)


    # dataframe containing only the values that were null/NA
    na_only = df[~df.index.isin(sent_null_removed.index) | ~df.index.isin(received_null_removed.index)]

    return (na_only, num_sent_null_removed, num_received_null_removed)

老实说，我不知道还能尝试什么。我在这里缺少一些 "Empty" 格式吗？ Pandas 将空白单元格识别为： ''、Empty、None 和 np.nan。是的，品种齐全。 :S

Answer 1

列中可能存储了空格或其他空白字符。像这样尝试：

WHERE col IS NOT NULL AND NOT col ~ '^\s*$'

Answer 2

这个问题的解决方案与使用正则表达式的模式识别有关。特别是，我尝试了大量的解决方案来过滤掉任何形式的会弄脏数据的错误数据类型table。不幸的是，解决方案并不是那么简单。

然而，数据（我 anticipating/want 保留）在结构上高度一致，所以我只是使用 RegEx 过滤掉任何与我的预期不符的数据条目。

尽管使用 NOT NULL 和 <> '' 仍出现空白行

Blank rows occuring despite use of NOT NULL and <> ''

postgresql

pandas

blank-line