如何在 R 中使用 tidyjson 处理嵌套的空 JSON 数组
How to deal with nested empty JSON arrays with tidyjson in R
我正在读取来自 Salesforce 的 JSON 对象。该对象是不规则的,因为有些嵌套数组是空的,有些不是。如何在 tidyjson 中处理这个问题?
我正在使用 R 中的 Salesforce 设置 API。objective 是为了从 Salesforce 中获取有意义的数据以在 R 中进行处理。
json <- '
{
"totalSize": [
355710
],
"done": [
false
],
"nextRecordsUrl": [
"/services/data/v45.0/query/01gc000001L8zdkAAB-749"
],
"records": [
{
"attributes": {
"type": "Order_Line__c",
"url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9lUAE"
},
"Id": "a0T1N000009aZ9lUAE",
"Name": "OrderLine-1099369",
"SO_Number_Formula__c": "548402-2.3",
"Ship_From_Inventory__c": "XXX",
"RMA_Number__c": "548402",
"Part_Number__c": "01t1N00000JNeAQQA1",
"Marketing_Part__c": "XXXXXXXXXXX",
"Family__c": "XXXXXXXX",
"Serial_Numbers__r": {
"records": {}
}
},
{
"attributes": {
"type": "Order_Line__c",
"url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9mUAE"
},
"Id": "a0T1N000009aZ9mUAE",
"Name": "OrderLine-1099370",
"SO_Number_Formula__c": "962816-1.1",
"Ship_From_Inventory__c": "XXX",
"RMA_Number__c": "962816",
"Part_Number__c": "01t1N00000JNc3qQAD",
"Marketing_Part__c": "XXXXXXXXXX",
"Family__c": "XXXXXXX",
"RMA_Received_Date__c": "2019-02-18",
"Serial_Numbers__r": {
"totalSize": 1,
"done": true,
"records": [
{
"attributes": {
"type": "Serial_Number__c",
"url": "/services/data/v45.0/sobjects/Serial_Number__c/a0X1N00000NoyAjUAJ"
},
"Id": "a0X1N00000NoyAjUAJ",
"Name": "SN217426",
"Legacy_Line_Id__c": "962816SN217426",
"Customer_Name__c": "XXXXXX",
"Original_Shipment_Date__c": "2018-06-26",
"Disposition__c": "Pending",
"Status__c": "FailureVerification"
}
]
}
}
]
}
'
mydata <- json %>%
as.tbl_json %>%
enter_object("records") %>%
gather_array() %>%
spread_values(
Id = jstring("Id"),
Name = jstring("Name"),
SO_Number_Formula = jstring("SO_Number_Formula__c"),
Ship_From_Inventory = jstring("Ship_From_Inventory__c"),
RMA_Number = jstring("RMA_Number__c"),
Part_Number = jstring("Part_Number__c"),
Marketing_Part = jstring("Marketing_Part__c"),
Family = jstring("Family__c")) %>%
enter_object("Serial_Numbers__r") %>%
enter_object("records") %>%
gather_ %>%
spread_values(
Id = jstring("Id"))
不规则在[记录][Serial_Numbers__r][记录]中。在此示例中,第一次出现为空 {},第二次出现不为空。
该代码在执行 gather_keys 或 gather _array 时会产生以下错误:
gather_keys(.) 中的错误:1 条记录是值而不是对象
gather_array(.) 中的错误:1 条记录是值而不是数组
我在想这是空数组[records]造成的。 Salesforce 输出中有很多这样的不规则性:有些记录有详细的嵌套数据,有些则没有。
我该如何处理?
这是一个很好的问题,我们确实应该有一种更简洁的方法来处理这个问题。 enter_object()
在这些类型的案例中被证明是有问题的,在这些案例中,您根据不规范的 JSON 做法丢失了记录。
我提交了一个问题来跟踪改进:https://github.com/colearendt/tidyjson/issues/121
与此同时,我通常这样做的方法是根据描述记录的特征拆分记录。在这种情况下,您可以在父对象上使用 gather_object()
以获得与 enter_object()
相同的效果,然后使用 filter
/ bind_rows
来区别对待行。
理想情况下 bind_rows()
在这里的管道中会更好地工作...这是我希望看到的对 dplyr
(Issue here) 的改进!我很想知道这是否能解决您的问题! (此外,请牢记 spread_all()
以简化某些列的指定,代价是包的一部分 "guessing"!)。
json <- '{
"totalSize": [
355710
],
"done": [
false
],
"nextRecordsUrl": [
"/services/data/v45.0/query/01gc000001L8zdkAAB-749"
],
"records": [
{
"attributes": {
"type": "Order_Line__c",
"url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9lUAE"
},
"Id": "a0T1N000009aZ9lUAE",
"Name": "OrderLine-1099369",
"SO_Number_Formula__c": "548402-2.3",
"Ship_From_Inventory__c": "XXX",
"RMA_Number__c": "548402",
"Part_Number__c": "01t1N00000JNeAQQA1",
"Marketing_Part__c": "XXXXXXXXXXX",
"Family__c": "XXXXXXXX",
"Serial_Numbers__r": {
"records": {}
}
},
{
"attributes": {
"type": "Order_Line__c",
"url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9mUAE"
},
"Id": "a0T1N000009aZ9mUAE",
"Name": "OrderLine-1099370",
"SO_Number_Formula__c": "962816-1.1",
"Ship_From_Inventory__c": "XXX",
"RMA_Number__c": "962816",
"Part_Number__c": "01t1N00000JNc3qQAD",
"Marketing_Part__c": "XXXXXXXXXX",
"Family__c": "XXXXXXX",
"RMA_Received_Date__c": "2019-02-18",
"Serial_Numbers__r": {
"totalSize": 1,
"done": true,
"records": [
{
"attributes": {
"type": "Serial_Number__c",
"url": "/services/data/v45.0/sobjects/Serial_Number__c/a0X1N00000NoyAjUAJ"
},
"Id": "a0X1N00000NoyAjUAJ",
"Name": "SN217426",
"Legacy_Line_Id__c": "962816SN217426",
"Customer_Name__c": "XXXXXX",
"Original_Shipment_Date__c": "2018-06-26",
"Disposition__c": "Pending",
"Status__c": "FailureVerification"
}
]
}
}
]
}
'
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(tidyjson)
#>
#> Attaching package: 'tidyjson'
#> The following object is masked from 'package:dplyr':
#>
#> bind_rows
#> The following object is masked from 'package:stats':
#>
#> filter
prep_data <- json %>%
as.tbl_json %>%
enter_object("records") %>%
gather_array() %>%
spread_values(
Id = jstring("Id"),
Name = jstring("Name"),
SO_Number_Formula = jstring("SO_Number_Formula__c"),
Ship_From_Inventory = jstring("Ship_From_Inventory__c"),
RMA_Number = jstring("RMA_Number__c"),
Part_Number = jstring("Part_Number__c"),
Marketing_Part = jstring("Marketing_Part__c"),
Family = jstring("Family__c")) %>%
enter_object("Serial_Numbers__r")
# show that types are different
prep_data %>%
gather_object("key") %>%
json_types() %>%
select(key, type) %>%
filter(key == "records")
#> # A tbl_json: 2 x 2 tibble with a "JSON" attribute
#> `attr(., "JSON")` key type
#> <chr> <chr> <fct>
#> 1 "{}" records object
#> 2 "[{\"attributes\":..." records array
# handle
taller <- prep_data %>%
gather_object("key") %>%
json_types("type") %>%
filter(key == "records")
final <- tidyjson::bind_rows(
taller %>% filter(type == "object"),
taller %>% filter(type == "array") %>%
gather_array("record_row") %>%
spread_values(
RecordId = jstring("Id")
)
)
final %>% select(key, type, record_row, RecordId)
#> # A tbl_json: 2 x 4 tibble with a "JSON" attribute
#> `attr(., "JSON")` key type record_row RecordId
#> <chr> <chr> <fct> <int> <chr>
#> 1 "{}" records object NA <NA>
#> 2 "{\"attributes\":{..." records array 1 a0X1N00000NoyAjUAJ
由 reprex package (v0.3.0)
于 2020-03-15 创建
我正在读取来自 Salesforce 的 JSON 对象。该对象是不规则的,因为有些嵌套数组是空的,有些不是。如何在 tidyjson 中处理这个问题?
我正在使用 R 中的 Salesforce 设置 API。objective 是为了从 Salesforce 中获取有意义的数据以在 R 中进行处理。
json <- '
{
"totalSize": [
355710
],
"done": [
false
],
"nextRecordsUrl": [
"/services/data/v45.0/query/01gc000001L8zdkAAB-749"
],
"records": [
{
"attributes": {
"type": "Order_Line__c",
"url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9lUAE"
},
"Id": "a0T1N000009aZ9lUAE",
"Name": "OrderLine-1099369",
"SO_Number_Formula__c": "548402-2.3",
"Ship_From_Inventory__c": "XXX",
"RMA_Number__c": "548402",
"Part_Number__c": "01t1N00000JNeAQQA1",
"Marketing_Part__c": "XXXXXXXXXXX",
"Family__c": "XXXXXXXX",
"Serial_Numbers__r": {
"records": {}
}
},
{
"attributes": {
"type": "Order_Line__c",
"url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9mUAE"
},
"Id": "a0T1N000009aZ9mUAE",
"Name": "OrderLine-1099370",
"SO_Number_Formula__c": "962816-1.1",
"Ship_From_Inventory__c": "XXX",
"RMA_Number__c": "962816",
"Part_Number__c": "01t1N00000JNc3qQAD",
"Marketing_Part__c": "XXXXXXXXXX",
"Family__c": "XXXXXXX",
"RMA_Received_Date__c": "2019-02-18",
"Serial_Numbers__r": {
"totalSize": 1,
"done": true,
"records": [
{
"attributes": {
"type": "Serial_Number__c",
"url": "/services/data/v45.0/sobjects/Serial_Number__c/a0X1N00000NoyAjUAJ"
},
"Id": "a0X1N00000NoyAjUAJ",
"Name": "SN217426",
"Legacy_Line_Id__c": "962816SN217426",
"Customer_Name__c": "XXXXXX",
"Original_Shipment_Date__c": "2018-06-26",
"Disposition__c": "Pending",
"Status__c": "FailureVerification"
}
]
}
}
]
}
'
mydata <- json %>%
as.tbl_json %>%
enter_object("records") %>%
gather_array() %>%
spread_values(
Id = jstring("Id"),
Name = jstring("Name"),
SO_Number_Formula = jstring("SO_Number_Formula__c"),
Ship_From_Inventory = jstring("Ship_From_Inventory__c"),
RMA_Number = jstring("RMA_Number__c"),
Part_Number = jstring("Part_Number__c"),
Marketing_Part = jstring("Marketing_Part__c"),
Family = jstring("Family__c")) %>%
enter_object("Serial_Numbers__r") %>%
enter_object("records") %>%
gather_ %>%
spread_values(
Id = jstring("Id"))
不规则在[记录][Serial_Numbers__r][记录]中。在此示例中,第一次出现为空 {},第二次出现不为空。 该代码在执行 gather_keys 或 gather _array 时会产生以下错误: gather_keys(.) 中的错误:1 条记录是值而不是对象 gather_array(.) 中的错误:1 条记录是值而不是数组
我在想这是空数组[records]造成的。 Salesforce 输出中有很多这样的不规则性:有些记录有详细的嵌套数据,有些则没有。 我该如何处理?
这是一个很好的问题,我们确实应该有一种更简洁的方法来处理这个问题。 enter_object()
在这些类型的案例中被证明是有问题的,在这些案例中,您根据不规范的 JSON 做法丢失了记录。
我提交了一个问题来跟踪改进:https://github.com/colearendt/tidyjson/issues/121
与此同时,我通常这样做的方法是根据描述记录的特征拆分记录。在这种情况下,您可以在父对象上使用 gather_object()
以获得与 enter_object()
相同的效果,然后使用 filter
/ bind_rows
来区别对待行。
理想情况下 bind_rows()
在这里的管道中会更好地工作...这是我希望看到的对 dplyr
(Issue here) 的改进!我很想知道这是否能解决您的问题! (此外,请牢记 spread_all()
以简化某些列的指定,代价是包的一部分 "guessing"!)。
json <- '{
"totalSize": [
355710
],
"done": [
false
],
"nextRecordsUrl": [
"/services/data/v45.0/query/01gc000001L8zdkAAB-749"
],
"records": [
{
"attributes": {
"type": "Order_Line__c",
"url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9lUAE"
},
"Id": "a0T1N000009aZ9lUAE",
"Name": "OrderLine-1099369",
"SO_Number_Formula__c": "548402-2.3",
"Ship_From_Inventory__c": "XXX",
"RMA_Number__c": "548402",
"Part_Number__c": "01t1N00000JNeAQQA1",
"Marketing_Part__c": "XXXXXXXXXXX",
"Family__c": "XXXXXXXX",
"Serial_Numbers__r": {
"records": {}
}
},
{
"attributes": {
"type": "Order_Line__c",
"url": "/services/data/v45.0/sobjects/Order_Line__c/a0T1N000009aZ9mUAE"
},
"Id": "a0T1N000009aZ9mUAE",
"Name": "OrderLine-1099370",
"SO_Number_Formula__c": "962816-1.1",
"Ship_From_Inventory__c": "XXX",
"RMA_Number__c": "962816",
"Part_Number__c": "01t1N00000JNc3qQAD",
"Marketing_Part__c": "XXXXXXXXXX",
"Family__c": "XXXXXXX",
"RMA_Received_Date__c": "2019-02-18",
"Serial_Numbers__r": {
"totalSize": 1,
"done": true,
"records": [
{
"attributes": {
"type": "Serial_Number__c",
"url": "/services/data/v45.0/sobjects/Serial_Number__c/a0X1N00000NoyAjUAJ"
},
"Id": "a0X1N00000NoyAjUAJ",
"Name": "SN217426",
"Legacy_Line_Id__c": "962816SN217426",
"Customer_Name__c": "XXXXXX",
"Original_Shipment_Date__c": "2018-06-26",
"Disposition__c": "Pending",
"Status__c": "FailureVerification"
}
]
}
}
]
}
'
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(tidyjson)
#>
#> Attaching package: 'tidyjson'
#> The following object is masked from 'package:dplyr':
#>
#> bind_rows
#> The following object is masked from 'package:stats':
#>
#> filter
prep_data <- json %>%
as.tbl_json %>%
enter_object("records") %>%
gather_array() %>%
spread_values(
Id = jstring("Id"),
Name = jstring("Name"),
SO_Number_Formula = jstring("SO_Number_Formula__c"),
Ship_From_Inventory = jstring("Ship_From_Inventory__c"),
RMA_Number = jstring("RMA_Number__c"),
Part_Number = jstring("Part_Number__c"),
Marketing_Part = jstring("Marketing_Part__c"),
Family = jstring("Family__c")) %>%
enter_object("Serial_Numbers__r")
# show that types are different
prep_data %>%
gather_object("key") %>%
json_types() %>%
select(key, type) %>%
filter(key == "records")
#> # A tbl_json: 2 x 2 tibble with a "JSON" attribute
#> `attr(., "JSON")` key type
#> <chr> <chr> <fct>
#> 1 "{}" records object
#> 2 "[{\"attributes\":..." records array
# handle
taller <- prep_data %>%
gather_object("key") %>%
json_types("type") %>%
filter(key == "records")
final <- tidyjson::bind_rows(
taller %>% filter(type == "object"),
taller %>% filter(type == "array") %>%
gather_array("record_row") %>%
spread_values(
RecordId = jstring("Id")
)
)
final %>% select(key, type, record_row, RecordId)
#> # A tbl_json: 2 x 4 tibble with a "JSON" attribute
#> `attr(., "JSON")` key type record_row RecordId
#> <chr> <chr> <fct> <int> <chr>
#> 1 "{}" records object NA <NA>
#> 2 "{\"attributes\":{..." records array 1 a0X1N00000NoyAjUAJ
由 reprex package (v0.3.0)
于 2020-03-15 创建