访问 Python 中的 PostgreSQL hstore 键和值并为每个键创建新的数据框列

Access PostgreSQL hstore keys and values in Python and create new dataframe column for each key

我管理一个 PostgreSQL 数据库,并且正在开发一个工具供用户访问数据库的一个子集。数据库有很多列,此外我们还使用大量的 hstore 键来存储特定于数据库中某些行的附加信息。下面的基本示例

A          B        C        hstore   
"foo"      1        4        "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway" 
"bar"      4        6        "Pet"=>"cat", "Country"=>"Suriname", "Number"=>"5"
"foobar"   2        8
"baz"      3        1        "Fruit"=>"apple", "Name"=>"David"

数据按常规导出为 CSV 文件,如下所示:

COPY tableName TO '/filepath/file.csv' DELIMITER ',' CSV HEADER;

我将其读入 Python 中的 Pandas 数据框,如下所示:

import pandas as pd
df = pd.read_csv('/filepath/file.csv')

然后我访问数据的一个子集。这个子集在大多数行中应该有一组通用的 hstore 键,但不一定是所有行。

我想为每个 hstore 键创建一个单独的列。如果行中不存在键,则单元格应留空,或填充 NULL 或 NAN 值,以最简单的方式进行。最有效的方法是什么?

您可以使用 .str.extractall() to extract the keys and values from column hstore, then use .pivot() to transform the keys to column labels. Aggregate the entries for each row in original dataframe by .groupby() and .agg(). Set NaN for empty entries with .replace(). Finally, join back the result dataframe to original dataframe with .join():

df.join(df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
             .reset_index()
             .pivot(index=['level_0', 'match'], columns=0, values=1)
             .groupby(level=0)
             .agg(lambda x: ''.join(x.dropna()))
             .replace('', np.nan)
       )

结果:

          A  B  C                                               hstore   Country  Fruit   Name  Pet
0     "foo"  1  4  "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway"    Norway  apple    NaN  dog
1     "bar"  4  6                  "Pet"=>"cat", "Country"=>"Suriname"  Suriname    NaN    NaN  cat
2  "foobar"  2  8                                                 None       NaN    NaN    NaN  NaN
3     "baz"  3  1                    "Fruit"=>"apple", "Name"=>"David"       NaN  apple  David  NaN

如果你想得到一个新的dataframe来提取而不是连接回原来的dataframe,你可以删除.join()步骤并做一个.reindex(),如下:

df_out = (df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
             .reset_index()
             .pivot(index=['level_0', 'match'], columns=0, values=1)
             .groupby(level=0)
             .agg(lambda x: ''.join(x.dropna()))
             .replace('', np.nan)
         )
df_out = df_out.reindex(df.index)

结果:

print(df_out)


    Country  Fruit   Name  Pet
0    Norway  apple    NaN  dog
1  Suriname    NaN    NaN  cat
2       NaN    NaN    NaN  NaN
3       NaN  apple  David  NaN