使用 RSQLite 直接使用 SQL 操作 r 中的数据框

Use RSQLite to manipulate data frame in r directly using SQL

我有一个表格的数据集

我想在 R 中使用 SQL 更改为下面的这种形式。

我知道我可以每天简单地使用 dplyr 来做这件事,但这里的重点是学习使用 SQL 来创建和操作一个小型关系数据库。

最小工作示例

# Data to copy into sheet

       Price                            coordinates floor.size surburb       date
 R 1 750 000 -33.93082074573843, 18.857342125467635      68 m²     Jhb 2021-06-24
 R 1 250 000 -33.930077157927855, 18.85420954236195      56 m²     Jhb 2021-06-17
 R 2 520 000 -33.92954929205658, 18.857504799977896      62 m²     Jhb 2021-06-24

在 R markdown 中操作的代码

```{r}
#install.packages("RSQLite", repos = "http://cran.us.r-project.org")

library(readxl)
library(dplyr)
library(RSQLite)
library(DBI)
library(knitr)

db <- dbConnect(RSQLite::SQLite(), ":memory:")

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(connection = "db")


# Import data
dataH <- read_excel("C:/Users/Dell/Desktop/exampledata.xlsx")

``` 

```{sql, connection = db}
# SQL code passed directly
```

编辑 1:

@Onyambu 的回答差不多。它会产生坐标错误。例如,在下图中,当 coordinate 为“-33.930989501123, 18.857270308516927”时,最后两个坐标应该具有以“18.85”而不是“.85”开头的 Long。我该如何解决这个问题?

使用基本的 sql 函数,您可以:

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,connection = "db")
```

```{r}
db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

txt <- "Price coordinates floor.size surburb date\n
     'R 1 750 000' '-33.93082074573843, 18.857342125467635' '68 m²' Jhb 2021-06-24\n
     'R 1 250 000' '-33.930077157927855, 18.85420954236195' '56 m²' Jhb 2021-06-17\n
     'R 2 520 000' '-33.92954929205658, 18.857504799977896' '62 m²' Jhb 2021-06-24"

dataH <- read.table(text = txt, header = TRUE) 
DBI::dbWriteTable(db, 'dataH', dataH)
```


```{sql}
SELECT REPLACE(SUBSTRING(price, 3, 100), ' ', '') price,
       replace(SUBSTRING(coordinates, 1, 20), ',', '') Lat,
       SUBSTRING(coordinates, 21, 255) Long,
       SUBSTRING(`floor.size`, 1, 2) floor_size,
       surburb,
       date
FROM dataH
```

您可以使用 charindexsubstr 来执行您需要的操作。我将使用 sqldf 进行演示,它在后台使用 SQLite 的引擎。 (此查询与 Onyambu 的查询非常相似,但解决了一个文本选择问题。)

dat <- structure(list(Price = c("R 1 750 000", "R 1 250 000", "R 2 520 000"), coordinates = c("-33.93082074573843, 18.857342125467635", "-33.930077157927855, 18.85420954236195", "-33.92954929205658, 18.857504799977896"), floor.size = c("68 m²", "56 m²", "62 m²"), surburb = c("Jhb", "Jhb", "Jhb"), date = c("2021-06-24", "2021-06-17", "2021-06-24")), class = "data.frame", row.names = c(NA, -3L))

out <- sqldf::sqldf(
  "select cast(replace(substr(price,2,99),' ','') as real) as price,
          cast(substr(coordinates,1,charindex(',',coordinates)-1) as real) as lat,
          cast(substr(coordinates,charindex(',',coordinates)+1,99) as real) as long,
          cast(substr([floor.size],1,charindex('m',[floor.size])-1) as real) as [floor.size]
   from dat", method = "raw")

out
#     price       lat     long floor.size
# 1 1750000 -33.93082 18.85734         68
# 2 1250000 -33.93008 18.85421         56
# 3 2520000 -33.92955 18.85750         62

str(out)
# 'data.frame': 3 obs. of  4 variables:
#  $ price     : num  1750000 1250000 2520000
#  $ lat       : num  -33.9 -33.9 -33.9
#  $ long      : num  18.9 18.9 18.9
#  $ floor.size: num  68 56 62

out 输出中显示的位数是由于 R 的 "digits" 选项,这些是 class numeric,如 str输出。)

如果更改为 sqldf(.., method="numeric"),则可以缩短并删除所有 cast(.. as ..)

out <- sqldf::sqldf(
  "select replace(substr(price,2,99),' ','') as price,
          substr(coordinates,1,charindex(',',coordinates)-1) as lat,
          substr(coordinates,charindex(',',coordinates)+1,99) as long,
          substr([floor.size],1,charindex('m',[floor.size])-1) as [floor.size]
   from dat", method = "numeric")