MongoDB 使用正则表达式删除数据

MongoDB delete data using regex

我能够使用以下方法通过 pandas 删除数据:

import re

repl = {r'<[^>]+>': '', 
        r'\r\n': ' ',
        r'Share to facebook|Share to twitter|Share to linkedin|Share on Facebook|Share on Twitter|Share on Messenger|Share on Whatsapp': ''}

articles['content'] = articles['content'].replace(repl, regex=True)

如何在 Atlas 中的实际数据库上执行相同的操作?

我的数据结构是:

_id:
title:
url:
description:
author:
publishedAt:
content:
source_id:
urlToImage:
summarization:

MongoDB 没有任何内置的运算符来执行正则表达式替换(目前)。

您可以在您选择的编程语言中使用正则表达式查找循环浏览文档,然后用这种方式替换。

from pymongo import MongoClient
import re


m_client = MongoClient("<MONGODB-URI-STRING")
db = m_client["<DB-NAME>"]
collection = db["<COLLECTION-NAME>"]

replace_dictionary = {
    r'<[^>]+>': '',
    r'\r\n': ' ',
    r'Share to facebook|Share to twitter|Share to linkedin|Share on Facebook|Share on Twitter|Share on Messenger|Share on Whatsapp': ''
}

count = 0

for it in collection.find({
    # Merge all refex finds to a single list
    "$or": [{"content": re.compile(x, re.IGNORECASE)} for x in replace_dictionary.keys()]
}, {
    # Project only the field to be replaced for faster execution of script
    "content": 1
}):
  #  Iterate over regex and replacements and apply the same using `re.sub` 
  for k, v in replace_dictionary.items():
    it["content"] = re.sub(
        pattern=k,
        repl=v,
        string=it["content"],
    )

  # Update the regex replaced string
  collection.update_one({
    "_id": it["_id"]
  }, {
    "$set": {
        "content": it['content']
    }
  })

  # Count to keep track of completion
  count += 1
  print("\r", count, end='')

print("DONE!!!")