访问 Python 中的 PostgreSQL hstore 键和值并为每个键创建新的数据框列
Access PostgreSQL hstore keys and values in Python and create new dataframe column for each key
我管理一个 PostgreSQL 数据库,并且正在开发一个工具供用户访问数据库的一个子集。数据库有很多列,此外我们还使用大量的 hstore 键来存储特定于数据库中某些行的附加信息。下面的基本示例
A B C hstore
"foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway"
"bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname", "Number"=>"5"
"foobar" 2 8
"baz" 3 1 "Fruit"=>"apple", "Name"=>"David"
数据按常规导出为 CSV 文件,如下所示:
COPY tableName TO '/filepath/file.csv' DELIMITER ',' CSV HEADER;
我将其读入 Python 中的 Pandas 数据框,如下所示:
import pandas as pd
df = pd.read_csv('/filepath/file.csv')
然后我访问数据的一个子集。这个子集在大多数行中应该有一组通用的 hstore 键,但不一定是所有行。
我想为每个 hstore 键创建一个单独的列。如果行中不存在键,则单元格应留空,或填充 NULL 或 NAN 值,以最简单的方式进行。最有效的方法是什么?
您可以使用 .str.extractall()
to extract the keys and values from column hstore
, then use .pivot()
to transform the keys to column labels. Aggregate the entries for each row in original dataframe by .groupby()
and .agg()
. Set NaN
for empty entries with .replace()
. Finally, join back the result dataframe to original dataframe with .join()
:
df.join(df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
结果:
A B C hstore Country Fruit Name Pet
0 "foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway" Norway apple NaN dog
1 "bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname" Suriname NaN NaN cat
2 "foobar" 2 8 None NaN NaN NaN NaN
3 "baz" 3 1 "Fruit"=>"apple", "Name"=>"David" NaN apple David NaN
如果你想得到一个新的dataframe来提取而不是连接回原来的dataframe,你可以删除.join()
步骤并做一个.reindex()
,如下:
df_out = (df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
df_out = df_out.reindex(df.index)
结果:
print(df_out)
Country Fruit Name Pet
0 Norway apple NaN dog
1 Suriname NaN NaN cat
2 NaN NaN NaN NaN
3 NaN apple David NaN
我管理一个 PostgreSQL 数据库,并且正在开发一个工具供用户访问数据库的一个子集。数据库有很多列,此外我们还使用大量的 hstore 键来存储特定于数据库中某些行的附加信息。下面的基本示例
A B C hstore
"foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway"
"bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname", "Number"=>"5"
"foobar" 2 8
"baz" 3 1 "Fruit"=>"apple", "Name"=>"David"
数据按常规导出为 CSV 文件,如下所示:
COPY tableName TO '/filepath/file.csv' DELIMITER ',' CSV HEADER;
我将其读入 Python 中的 Pandas 数据框,如下所示:
import pandas as pd
df = pd.read_csv('/filepath/file.csv')
然后我访问数据的一个子集。这个子集在大多数行中应该有一组通用的 hstore 键,但不一定是所有行。
我想为每个 hstore 键创建一个单独的列。如果行中不存在键,则单元格应留空,或填充 NULL 或 NAN 值,以最简单的方式进行。最有效的方法是什么?
您可以使用 .str.extractall()
to extract the keys and values from column hstore
, then use .pivot()
to transform the keys to column labels. Aggregate the entries for each row in original dataframe by .groupby()
and .agg()
. Set NaN
for empty entries with .replace()
. Finally, join back the result dataframe to original dataframe with .join()
:
df.join(df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
结果:
A B C hstore Country Fruit Name Pet
0 "foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway" Norway apple NaN dog
1 "bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname" Suriname NaN NaN cat
2 "foobar" 2 8 None NaN NaN NaN NaN
3 "baz" 3 1 "Fruit"=>"apple", "Name"=>"David" NaN apple David NaN
如果你想得到一个新的dataframe来提取而不是连接回原来的dataframe,你可以删除.join()
步骤并做一个.reindex()
,如下:
df_out = (df['hstore'].str.extractall(r'\"(.+?)\"=>\"(.+?)\"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
df_out = df_out.reindex(df.index)
结果:
print(df_out)
Country Fruit Name Pet
0 Norway apple NaN dog
1 Suriname NaN NaN cat
2 NaN NaN NaN NaN
3 NaN apple David NaN