PostgreSQL,R:乘以 table 的所有行来创建面板数据(时间序列)
PostgreSQL, R: Multiply all rows of table to create Panel-data (time-series)
我有一个 table buildings
有 320 万行。我需要将这个 table 扩展到 11 个不同的时期,以将其处理为 (balanced) Paneldata。这意味着对于每个物体都有 11 个不同的年份(从 2000 年到 2010 年)需要观察。句点应称为:
2000
2001
...
2009
2010
TABLE 定义
CREATE TABLE public.buildings
(
gid integer NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
osm_id character varying(11),
name character varying(48),
type character varying(16),
geom geometry(MultiPolygon,4326),
centroid geometry(Point,4326),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr smallint,
capac double precision,
instdate date,
pvid integer,
dist double precision,
gemewz integer,
n500 integer,
ibase double precision,
popden integer,
instp smallint,
b2000 double precision,
b2001 double precision,
b2002 double precision,
b2003 double precision,
b2004 double precision,
b2005 double precision,
b2006 double precision,
b2007 double precision,
b2008 double precision,
b2009 double precision,
b2010 double precision,
ibase_id integer[],
ibase_dist integer[],
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.buildings
OWNER TO postgres;
CREATE INDEX build_centroid_gix
ON public.buildings
USING gist
(st_transform(centroid, 31467));
CREATE INDEX buildings_geom_idx
ON public.buildings
USING gist
(geom);
我想使用 R 中的数据进行回归分析。
ibase_id
是gid
的数组。
ibase_dist
是一个相关数组,其中包含 gid
到对象的距离。两个数组的长度总是相同的。
数组中的gid
属于buildings
的记录,它们在centroid
周围500m半径内,对象的中心,并且有pv =TRUE(这意味着 dist
、instdate
、instp
、capac
&pvid
是 NOT NULL
)。
SELECT a.gid AS buildid, array_agg(b.gid) AS ibase_id, array_agg(round(ST_Distance(ST_Transform(a.centroid, 31467), ST_Transform(b.centroid, 31467))::integer)) AS ibase_dist
FROM buildings a
LEFT JOIN (SELECT * FROM buildings WHERE pv=TRUE) AS b ON ST_DWithin(ST_Transform(a.centroid, 31467), ST_Transform(b.centroid, 31467), 500.0)
AND a.gid <> b.gid
GROUP BY a.gid
示例:
ibase_id: {3075528,409073,322311,226643,833798,322344,226609}
;
ibase_dist {290,293,398,494,411,381,384}
UPDATE buildings
SET ibase=SUM(1/s)
FROM unnest(SELECT ibasedist FROM buildings WHERE (SELECT instp
FROM buildings
WHERE gid IN unnest(ibase_id))<year) s
对于每个时期,只应考虑数组的条目,其年份早于面板数据的观察时期。 (上面的查询不起作用,因为我需要先连接数组)现在,这两个数组保存了所有年份的信息。这就是为什么我认为应该将它们添加到每个时间段,以便在扩展到面板数据之后,我计算每个记录的 ibase
(11x 320 万)。
我不需要所有的列来进行回归分析。如果它会显着提高乘法性能,我们可以坚持使用行(基本上忽略几何列):
gid integer NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr smallint,
capac double precision,
dist double precision,
gemewz integer,
n500 integer,
ibase double precision,
popden integer,
instp smallint,
b2000 double precision,
b2001 double precision,
b2002 double precision,
b2003 double precision,
b2004 double precision,
b2005 double precision,
b2006 double precision,
b2007 double precision,
b2008 double precision,
b2009 double precision,
b2010 double precision,
ibase_id integer[],
ibase_dist integer[],
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
)
WITH (
OIDS=FALSE
解决方法
我的基本想法是创建第二个包含 11 个不同时期的 table periods
,并将此 table 与 table buildings
相乘。不确定如何实现这一点。不幸的是,我对 R 没有太多经验,也没有使用 Database Interface for R。
使用 PostgreSQL 9.5beta2,由 Visual C++ build 1800、64 位和 R x64 3.2.1 编译
面板数据集本质上是长格式的数据,每条记录重复年份作为时间列。您当前的结构采用 wide 格式。虽然 R 可以转换这个非常大的数据集,但 PostGreSQL 可以使用它的引擎在一个联合查询中将所有年份堆叠在一起,并将结果集传递给 R。请注意一些数据类型,例如几何对象和数组可能无法正确转换为R 数据类型,因此删除它们或将它们转换为 string/numeric 类型。
下面是这样一个 SQL UNION 查询,其中包含堆叠的年份。我不太确定您对 ibase_id
和 ibase_dist
或 "multiplying" 方面的意思,但是 Year
列添加了相应的 b
列。让 R 脚本通过 RPostGreSQL
模块调用它。
import("RPostgreSQL")
# CREATE CONNECTION
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "postgres",
host = "localhost", port = ####,
user = "username", password = "password")
strSQL <- "SELECT '2000' As year, gid, gembez, gemname, krsbez,
krsname, pv, gr, capac, dist, gemewz, n500
popden, instp, b2000 As b, (1/ibase_dist) As ibase
FROM public.buildings
INNER JOIN
(SELECT a.gid AS buildid,
SUM(round(ST_Distance(
ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467)
)::integer)) AS ibase_dist
FROM buildings a
LEFT JOIN buildings b
ON ST_DWithin(ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467), 500.0)
AND a.gid <> b.gid
WHERE b.pv=True AND b.instp < a.instp
GROUP BY a.gid) AS distSum
ON public.buildings.gid = distSum.buildid
WHERE public.buildings.instp = 2000
UNION
...other SELECT statements for years 2001-2010..."
# IMPORT QUERY RESULTSET INTO DATAFRAME
df <- dbGetQuery(con, strSQL)
# CLOSE CONNECTION
dbDisconnect(con)
但请确保您拥有操作大数据集所必需的 RAM。您可能需要相应地分配内存。或者,您可以迭代地将每年的 SELECT
语句附加到不断增长的数据框对象中,而不是一次加载所有内容。
# ...SAME CONNECTION SETUP AS ABOVE...
years = c('2000', '2001', '2002', '2003', '2004', '2005',
'2006', '2007', '2008', '2009', '2010')
# CREATES LIST OF YEAR DATA FRAME
dfList = lapply(years,
function(y) {
# NOTICE CONCATENATION OF Y IN SELECT STATEMENT
strSQL <- paste0("SELECT '", y, "' As year, gid, gembez, gemname, krsbez,
krsname, pv, gr, capac, dist, gemewz, n500,
popden, instp, b", y, ", As b, (1/ibase_dist) As ibase,
FROM public.buildings
INNER JOIN
(SELECT a.gid AS buildid,
SUM(round(ST_Distance(
ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467)
)::integer)) AS ibase_dist
FROM buildings a
LEFT JOIN buildings b
ON ST_DWithin(ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467), 500.0)
AND a.gid <> b.gid
WHERE b.pv=True AND b.instp < a.instp
GROUP BY a.gid) AS distSum
ON public.buildings.gid = distSum.buildid
WHERE public.buildings.instp =", y)
dbGetQuery(con, strSQL)
})
# APPEND LIST OF DATA FRAMES INTO ONE LARGE DATA FRAME
df <- do.call(rbind, dfList)
# REMOVE PREVIOUS LIST FOR MEMORY RESOURCES
rm(dfList)
# CLOSE CONNECTION
dbDisconnect(con)
我通过使用包含句点的临时 table t1 的交叉连接创建了 Paneldata table。
CREATE TABLE public.t1
(
period smallint
)
WITH (
OIDS=FALSE
);
CREATE TABLE paneldata AS
(SELECT *
FROM t1 CROSS JOIN
(SELECT gid,
gemname,
gembez,
krsname,
krsbez,
pv,
gr,
capac,
dist,
gemewz,
n500,
popden,
instp
FROM buildings) AS test
ORDER BY gid)
我有一个 table buildings
有 320 万行。我需要将这个 table 扩展到 11 个不同的时期,以将其处理为 (balanced) Paneldata。这意味着对于每个物体都有 11 个不同的年份(从 2000 年到 2010 年)需要观察。句点应称为:
2000
2001
...
2009
2010
TABLE 定义
CREATE TABLE public.buildings
(
gid integer NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
osm_id character varying(11),
name character varying(48),
type character varying(16),
geom geometry(MultiPolygon,4326),
centroid geometry(Point,4326),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr smallint,
capac double precision,
instdate date,
pvid integer,
dist double precision,
gemewz integer,
n500 integer,
ibase double precision,
popden integer,
instp smallint,
b2000 double precision,
b2001 double precision,
b2002 double precision,
b2003 double precision,
b2004 double precision,
b2005 double precision,
b2006 double precision,
b2007 double precision,
b2008 double precision,
b2009 double precision,
b2010 double precision,
ibase_id integer[],
ibase_dist integer[],
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.buildings
OWNER TO postgres;
CREATE INDEX build_centroid_gix
ON public.buildings
USING gist
(st_transform(centroid, 31467));
CREATE INDEX buildings_geom_idx
ON public.buildings
USING gist
(geom);
我想使用 R 中的数据进行回归分析。
ibase_id
是gid
的数组。
ibase_dist
是一个相关数组,其中包含 gid
到对象的距离。两个数组的长度总是相同的。
数组中的gid
属于buildings
的记录,它们在centroid
周围500m半径内,对象的中心,并且有pv =TRUE(这意味着 dist
、instdate
、instp
、capac
&pvid
是 NOT NULL
)。
SELECT a.gid AS buildid, array_agg(b.gid) AS ibase_id, array_agg(round(ST_Distance(ST_Transform(a.centroid, 31467), ST_Transform(b.centroid, 31467))::integer)) AS ibase_dist
FROM buildings a
LEFT JOIN (SELECT * FROM buildings WHERE pv=TRUE) AS b ON ST_DWithin(ST_Transform(a.centroid, 31467), ST_Transform(b.centroid, 31467), 500.0)
AND a.gid <> b.gid
GROUP BY a.gid
示例:
ibase_id: {3075528,409073,322311,226643,833798,322344,226609}
;
ibase_dist {290,293,398,494,411,381,384}
UPDATE buildings
SET ibase=SUM(1/s)
FROM unnest(SELECT ibasedist FROM buildings WHERE (SELECT instp
FROM buildings
WHERE gid IN unnest(ibase_id))<year) s
对于每个时期,只应考虑数组的条目,其年份早于面板数据的观察时期。 (上面的查询不起作用,因为我需要先连接数组)现在,这两个数组保存了所有年份的信息。这就是为什么我认为应该将它们添加到每个时间段,以便在扩展到面板数据之后,我计算每个记录的 ibase
(11x 320 万)。
我不需要所有的列来进行回归分析。如果它会显着提高乘法性能,我们可以坚持使用行(基本上忽略几何列):
gid integer NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr smallint,
capac double precision,
dist double precision,
gemewz integer,
n500 integer,
ibase double precision,
popden integer,
instp smallint,
b2000 double precision,
b2001 double precision,
b2002 double precision,
b2003 double precision,
b2004 double precision,
b2005 double precision,
b2006 double precision,
b2007 double precision,
b2008 double precision,
b2009 double precision,
b2010 double precision,
ibase_id integer[],
ibase_dist integer[],
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
)
WITH (
OIDS=FALSE
解决方法
我的基本想法是创建第二个包含 11 个不同时期的 table periods
,并将此 table 与 table buildings
相乘。不确定如何实现这一点。不幸的是,我对 R 没有太多经验,也没有使用 Database Interface for R。
使用 PostgreSQL 9.5beta2,由 Visual C++ build 1800、64 位和 R x64 3.2.1 编译
面板数据集本质上是长格式的数据,每条记录重复年份作为时间列。您当前的结构采用 wide 格式。虽然 R 可以转换这个非常大的数据集,但 PostGreSQL 可以使用它的引擎在一个联合查询中将所有年份堆叠在一起,并将结果集传递给 R。请注意一些数据类型,例如几何对象和数组可能无法正确转换为R 数据类型,因此删除它们或将它们转换为 string/numeric 类型。
下面是这样一个 SQL UNION 查询,其中包含堆叠的年份。我不太确定您对 ibase_id
和 ibase_dist
或 "multiplying" 方面的意思,但是 Year
列添加了相应的 b
列。让 R 脚本通过 RPostGreSQL
模块调用它。
import("RPostgreSQL")
# CREATE CONNECTION
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "postgres",
host = "localhost", port = ####,
user = "username", password = "password")
strSQL <- "SELECT '2000' As year, gid, gembez, gemname, krsbez,
krsname, pv, gr, capac, dist, gemewz, n500
popden, instp, b2000 As b, (1/ibase_dist) As ibase
FROM public.buildings
INNER JOIN
(SELECT a.gid AS buildid,
SUM(round(ST_Distance(
ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467)
)::integer)) AS ibase_dist
FROM buildings a
LEFT JOIN buildings b
ON ST_DWithin(ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467), 500.0)
AND a.gid <> b.gid
WHERE b.pv=True AND b.instp < a.instp
GROUP BY a.gid) AS distSum
ON public.buildings.gid = distSum.buildid
WHERE public.buildings.instp = 2000
UNION
...other SELECT statements for years 2001-2010..."
# IMPORT QUERY RESULTSET INTO DATAFRAME
df <- dbGetQuery(con, strSQL)
# CLOSE CONNECTION
dbDisconnect(con)
但请确保您拥有操作大数据集所必需的 RAM。您可能需要相应地分配内存。或者,您可以迭代地将每年的 SELECT
语句附加到不断增长的数据框对象中,而不是一次加载所有内容。
# ...SAME CONNECTION SETUP AS ABOVE...
years = c('2000', '2001', '2002', '2003', '2004', '2005',
'2006', '2007', '2008', '2009', '2010')
# CREATES LIST OF YEAR DATA FRAME
dfList = lapply(years,
function(y) {
# NOTICE CONCATENATION OF Y IN SELECT STATEMENT
strSQL <- paste0("SELECT '", y, "' As year, gid, gembez, gemname, krsbez,
krsname, pv, gr, capac, dist, gemewz, n500,
popden, instp, b", y, ", As b, (1/ibase_dist) As ibase,
FROM public.buildings
INNER JOIN
(SELECT a.gid AS buildid,
SUM(round(ST_Distance(
ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467)
)::integer)) AS ibase_dist
FROM buildings a
LEFT JOIN buildings b
ON ST_DWithin(ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467), 500.0)
AND a.gid <> b.gid
WHERE b.pv=True AND b.instp < a.instp
GROUP BY a.gid) AS distSum
ON public.buildings.gid = distSum.buildid
WHERE public.buildings.instp =", y)
dbGetQuery(con, strSQL)
})
# APPEND LIST OF DATA FRAMES INTO ONE LARGE DATA FRAME
df <- do.call(rbind, dfList)
# REMOVE PREVIOUS LIST FOR MEMORY RESOURCES
rm(dfList)
# CLOSE CONNECTION
dbDisconnect(con)
我通过使用包含句点的临时 table t1 的交叉连接创建了 Paneldata table。
CREATE TABLE public.t1
(
period smallint
)
WITH (
OIDS=FALSE
);
CREATE TABLE paneldata AS
(SELECT *
FROM t1 CROSS JOIN
(SELECT gid,
gemname,
gembez,
krsname,
krsbez,
pv,
gr,
capac,
dist,
gemewz,
n500,
popden,
instp
FROM buildings) AS test
ORDER BY gid)