如何 overwrite/update 来自 Databrick/PySpark 的 Azure Cosmos DB 中的集合

Question

我在 Databricks Notebook 上编写了以下 PySpark 代码，它成功地将结果从 sparkSQL 保存到 Azure Cosmos DB，代码行：

df.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**writeConfig3).save()

完整代码如下：

test = spark.sql("""SELECT
  Sales.CustomerID AS pattersonID1
 ,Sales.InvoiceNumber AS myinvoicenr1
FROM Sales
limit 4""")


## my personal cosmos DB
writeConfig3 = {
    "Endpoint": "https://<cosmosdb-account>.documents.azure.com:443/",
    "Masterkey": "<key>==",
    "Database": "mydatabase",
    "Collection": "mycontainer",
    "Upsert": "true"
}

df = test.coalesce(1)

df.write.format("com.microsoft.azure.cosmosdb.spark").mode("overwrite").options(**writeConfig3).save()

使用上面的代码我已经成功地写入了我的 Cosmos DB 数据库 (mydatabase) 和集合 (mycontainer)

当我尝试通过使用以下内容更改 SparkSQL 来覆盖容器时（只需将 pattersonID1 更改为 pattersonID2，将 myinvoicenr1 更改为 myinvoicenr2

test = spark.sql("""SELECT
  Sales.CustomerID AS pattersonID2
 ,Sales.InvoiceNumber AS myinvoicenr2
FROM Sales
limit 4""")

而是 overwriting/updating 具有新查询 Cosmos DB 的集合按如下方式附加容器：

并且仍然在集合中保留原始查询：

有没有办法完全覆盖或更新 cosmos DB？

Answer 1

您的问题是文档具有唯一的 id（这是您从未指定的，因此会作为 guid 自动为您生成）。当您编写新文档时，您刚刚将其中一个非 id、非唯一属性 pattersonID1 重命名为 pattersonID2，它只是在创建一个新文档，如预期的。无法知道这个新文档是否与原始文档相关，因为它是一个全新的文档，具有自己的一组属性。

您可以更新现有文档，方法是查询（或读取）、修改它们，然后替换它们。或者您可以选择查询旧文档并删除它们（一个一个地删除，或者通过存储过程在分区内作为事务性的一批删除）。最后，您可以删除并重新创建一个容器，这将删除当前存储在其中的所有文档。

Answer 2

您可以使用适用于 Python 的 Azure Cosmos DB SQL API SDK 来管理数据库及其 JSON 文档，而不是使用 Spark to Cosmos DB Connector此 NoSQL 数据库服务中包含：

Create Cosmos DB databases and modify their settings

Create and modify containers to store collections of JSON documents

Create, read, update, and delete the items (JSON documents) in your containers

Query the documents in your database using SQL-like syntax.

Azure Cosmos DB SQL API client library for Python

如何 overwrite/update 来自 Databrick/PySpark 的 Azure Cosmos DB 中的集合

How to overwrite/update a collection in Azure Cosmos DB from Databrick/PySpark

pyspark

pyspark-sql

azure-cosmosdb

azure-databricks