使用 RSQLite 直接使用 SQL 操作 r 中的数据框
Use RSQLite to manipulate data frame in r directly using SQL
我有一个表格的数据集
我想在 R 中使用 SQL 更改为下面的这种形式。
我知道我可以每天简单地使用 dplyr
来做这件事,但这里的重点是学习使用 SQL 来创建和操作一个小型关系数据库。
Price
需要转成数值。删除中间的“R”和 spaces。
coordinates
需要转成2个坐标Long
和Lat
floor size
需要从字符串中转换为数字,删除末尾的 space 和“m^2”。
最小工作示例
# Data to copy into sheet
Price coordinates floor.size surburb date
R 1 750 000 -33.93082074573843, 18.857342125467635 68 m² Jhb 2021-06-24
R 1 250 000 -33.930077157927855, 18.85420954236195 56 m² Jhb 2021-06-17
R 2 520 000 -33.92954929205658, 18.857504799977896 62 m² Jhb 2021-06-24
在 R markdown 中操作的代码
```{r}
#install.packages("RSQLite", repos = "http://cran.us.r-project.org")
library(readxl)
library(dplyr)
library(RSQLite)
library(DBI)
library(knitr)
db <- dbConnect(RSQLite::SQLite(), ":memory:")
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(connection = "db")
# Import data
dataH <- read_excel("C:/Users/Dell/Desktop/exampledata.xlsx")
```
```{sql, connection = db}
# SQL code passed directly
```
编辑 1:
@Onyambu 的回答差不多。它会产生坐标错误。例如,在下图中,当 coordinate
为“-33.930989501123, 18.857270308516927”时,最后两个坐标应该具有以“18.85”而不是“.85”开头的 Long
。我该如何解决这个问题?
使用基本的 sql 函数,您可以:
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,connection = "db")
```
```{r}
db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
txt <- "Price coordinates floor.size surburb date\n
'R 1 750 000' '-33.93082074573843, 18.857342125467635' '68 m²' Jhb 2021-06-24\n
'R 1 250 000' '-33.930077157927855, 18.85420954236195' '56 m²' Jhb 2021-06-17\n
'R 2 520 000' '-33.92954929205658, 18.857504799977896' '62 m²' Jhb 2021-06-24"
dataH <- read.table(text = txt, header = TRUE)
DBI::dbWriteTable(db, 'dataH', dataH)
```
```{sql}
SELECT REPLACE(SUBSTRING(price, 3, 100), ' ', '') price,
replace(SUBSTRING(coordinates, 1, 20), ',', '') Lat,
SUBSTRING(coordinates, 21, 255) Long,
SUBSTRING(`floor.size`, 1, 2) floor_size,
surburb,
date
FROM dataH
```
您可以使用 charindex
和 substr
来执行您需要的操作。我将使用 sqldf
进行演示,它在后台使用 SQLite 的引擎。 (此查询与 Onyambu 的查询非常相似,但解决了一个文本选择问题。)
dat <- structure(list(Price = c("R 1 750 000", "R 1 250 000", "R 2 520 000"), coordinates = c("-33.93082074573843, 18.857342125467635", "-33.930077157927855, 18.85420954236195", "-33.92954929205658, 18.857504799977896"), floor.size = c("68 m²", "56 m²", "62 m²"), surburb = c("Jhb", "Jhb", "Jhb"), date = c("2021-06-24", "2021-06-17", "2021-06-24")), class = "data.frame", row.names = c(NA, -3L))
out <- sqldf::sqldf(
"select cast(replace(substr(price,2,99),' ','') as real) as price,
cast(substr(coordinates,1,charindex(',',coordinates)-1) as real) as lat,
cast(substr(coordinates,charindex(',',coordinates)+1,99) as real) as long,
cast(substr([floor.size],1,charindex('m',[floor.size])-1) as real) as [floor.size]
from dat", method = "raw")
out
# price lat long floor.size
# 1 1750000 -33.93082 18.85734 68
# 2 1250000 -33.93008 18.85421 56
# 3 2520000 -33.92955 18.85750 62
str(out)
# 'data.frame': 3 obs. of 4 variables:
# $ price : num 1750000 1250000 2520000
# $ lat : num -33.9 -33.9 -33.9
# $ long : num 18.9 18.9 18.9
# $ floor.size: num 68 56 62
(out
输出中显示的位数是由于 R 的 "digits"
选项,这些是 class numeric
,如 str
输出。)
如果更改为 sqldf(.., method="numeric")
,则可以缩短并删除所有 cast(.. as ..)
。
out <- sqldf::sqldf(
"select replace(substr(price,2,99),' ','') as price,
substr(coordinates,1,charindex(',',coordinates)-1) as lat,
substr(coordinates,charindex(',',coordinates)+1,99) as long,
substr([floor.size],1,charindex('m',[floor.size])-1) as [floor.size]
from dat", method = "numeric")
我有一个表格的数据集
我想在 R 中使用 SQL 更改为下面的这种形式。
我知道我可以每天简单地使用 dplyr
来做这件事,但这里的重点是学习使用 SQL 来创建和操作一个小型关系数据库。
Price
需要转成数值。删除中间的“R”和 spaces。coordinates
需要转成2个坐标Long
和Lat
floor size
需要从字符串中转换为数字,删除末尾的 space 和“m^2”。
最小工作示例
# Data to copy into sheet
Price coordinates floor.size surburb date
R 1 750 000 -33.93082074573843, 18.857342125467635 68 m² Jhb 2021-06-24
R 1 250 000 -33.930077157927855, 18.85420954236195 56 m² Jhb 2021-06-17
R 2 520 000 -33.92954929205658, 18.857504799977896 62 m² Jhb 2021-06-24
在 R markdown 中操作的代码
```{r}
#install.packages("RSQLite", repos = "http://cran.us.r-project.org")
library(readxl)
library(dplyr)
library(RSQLite)
library(DBI)
library(knitr)
db <- dbConnect(RSQLite::SQLite(), ":memory:")
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(connection = "db")
# Import data
dataH <- read_excel("C:/Users/Dell/Desktop/exampledata.xlsx")
```
```{sql, connection = db}
# SQL code passed directly
```
编辑 1:
@Onyambu 的回答差不多。它会产生坐标错误。例如,在下图中,当 coordinate
为“-33.930989501123, 18.857270308516927”时,最后两个坐标应该具有以“18.85”而不是“.85”开头的 Long
。我该如何解决这个问题?
使用基本的 sql 函数,您可以:
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,connection = "db")
```
```{r}
db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
txt <- "Price coordinates floor.size surburb date\n
'R 1 750 000' '-33.93082074573843, 18.857342125467635' '68 m²' Jhb 2021-06-24\n
'R 1 250 000' '-33.930077157927855, 18.85420954236195' '56 m²' Jhb 2021-06-17\n
'R 2 520 000' '-33.92954929205658, 18.857504799977896' '62 m²' Jhb 2021-06-24"
dataH <- read.table(text = txt, header = TRUE)
DBI::dbWriteTable(db, 'dataH', dataH)
```
```{sql}
SELECT REPLACE(SUBSTRING(price, 3, 100), ' ', '') price,
replace(SUBSTRING(coordinates, 1, 20), ',', '') Lat,
SUBSTRING(coordinates, 21, 255) Long,
SUBSTRING(`floor.size`, 1, 2) floor_size,
surburb,
date
FROM dataH
```
您可以使用 charindex
和 substr
来执行您需要的操作。我将使用 sqldf
进行演示,它在后台使用 SQLite 的引擎。 (此查询与 Onyambu 的查询非常相似,但解决了一个文本选择问题。)
dat <- structure(list(Price = c("R 1 750 000", "R 1 250 000", "R 2 520 000"), coordinates = c("-33.93082074573843, 18.857342125467635", "-33.930077157927855, 18.85420954236195", "-33.92954929205658, 18.857504799977896"), floor.size = c("68 m²", "56 m²", "62 m²"), surburb = c("Jhb", "Jhb", "Jhb"), date = c("2021-06-24", "2021-06-17", "2021-06-24")), class = "data.frame", row.names = c(NA, -3L))
out <- sqldf::sqldf(
"select cast(replace(substr(price,2,99),' ','') as real) as price,
cast(substr(coordinates,1,charindex(',',coordinates)-1) as real) as lat,
cast(substr(coordinates,charindex(',',coordinates)+1,99) as real) as long,
cast(substr([floor.size],1,charindex('m',[floor.size])-1) as real) as [floor.size]
from dat", method = "raw")
out
# price lat long floor.size
# 1 1750000 -33.93082 18.85734 68
# 2 1250000 -33.93008 18.85421 56
# 3 2520000 -33.92955 18.85750 62
str(out)
# 'data.frame': 3 obs. of 4 variables:
# $ price : num 1750000 1250000 2520000
# $ lat : num -33.9 -33.9 -33.9
# $ long : num 18.9 18.9 18.9
# $ floor.size: num 68 56 62
(out
输出中显示的位数是由于 R 的 "digits"
选项,这些是 class numeric
,如 str
输出。)
如果更改为 sqldf(.., method="numeric")
,则可以缩短并删除所有 cast(.. as ..)
。
out <- sqldf::sqldf(
"select replace(substr(price,2,99),' ','') as price,
substr(coordinates,1,charindex(',',coordinates)-1) as lat,
substr(coordinates,charindex(',',coordinates)+1,99) as long,
substr([floor.size],1,charindex('m',[floor.size])-1) as [floor.size]
from dat", method = "numeric")