Data Loss Prevention 在屏蔽电子邮件时发现多余的实体

Data Loss Prevention finds superfluous entities when masking email

我正在调用 DLP API 来屏蔽文本中的人名和电子邮件地址,使用以下请求:

请求

{
  "item": {
    "value": "Eleanor Rigby\nPharmacist\neleanor.rigby@example.com"
  },
  "deidentifyConfig": {
    "infoTypeTransformations": {
      "transformations": [
        {
          "infoTypes": [ { "name": "EMAIL_ADDRESS" } ],
          "primitiveTransformation": {
            "characterMaskConfig": {
              "maskingCharacter": "#",
              "reverseOrder": false,
              "charactersToIgnore": [
                {
                  "charactersToSkip": ".@"
                }
              ]
            }
          }
        },
        {
          "infoTypes": [ { "name": "PERSON_NAME" } ],
          "primitiveTransformation": {
            "replaceConfig": {
              "newValue": {
                "stringValue": "(person)"
              }
            }
          }
        }
      ]
    }
  },
  "inspectConfig": {
    "infoTypes": [ { "name": "EMAIL_ADDRESS" }, { "name": "PERSON_NAME" } ]
  }
}

API 来电

curl -s \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://dlp.googleapis.com/v2/projects/$PROJECT_ID/content:deidentify \
  -d @gcp-dlp/input/text-request.json

回应

{
  "item": {
    "value": "(person)\nPharmacist\n(person)#######.#####@#######.###(person)"
  },
  "overview": {
    "transformedBytes": "50",
    "transformationSummaries": [
      {
        "infoType": {
          "name": "EMAIL_ADDRESS"
        },
        "transformation": {
          "characterMaskConfig": {
            "maskingCharacter": "#",
            "charactersToIgnore": [
              {
                "charactersToSkip": ".@"
              }
            ]
          }
        },
        "results": [
          {
            "count": "1",
            "code": "SUCCESS"
          }
        ],
        "transformedBytes": "25"
      },
      {
        "infoType": {
          "name": "PERSON_NAME"
        },
        "transformation": {
          "replaceConfig": {
            "newValue": {
              "stringValue": "(person)"
            }
          }
        },
        "results": [
          {
            "count": "3",
            "code": "SUCCESS"
          }
        ],
        "transformedBytes": "25"
      }
    ]
  }
}

请求(仅文本)

Eleanor Rigby
Pharmacist
eleanor.rigby@example.com

回复(仅文本)

(person)
Pharmacist
(person)#######.#####@#######.###(person)

输入文本包含人名和电子邮件地址。两者都按预期被检测和屏蔽。但是,在屏蔽的电子邮件地址前后添加了额外的 (person) 标签。

这是一个非常简单的示例,但我在以这种方式处理的每个文档中都观察到这种行为。

为什么多次检测到人实体?

这个问题 was reported at Google Public Issue Tracker,此类请求未编入索引,但这是报告问题或请求新功能的好方法。请关注本案例更新。

Google 提出了一个解决方法:

This is a case where we have some undefined behavior when findings overlap. The person comes from the user's configuration to replace people name with person.

They can omit the overlaps.

更多信息,请查看文档Modifying infoType detectors to refine scan results section Omit matches on PERSON_NAME detector if also matched by EMAIL_ADDRESS detector:

The following JSON snippet and code in several languages illustrate how to indicate to Cloud DLP using an InspectConfig that it should only return one match in the case that matches for the PERSON_NAME detector overlap with matches for the EMAIL_ADDRESS detector. Doing this is to avoid the situation where an email address such as "james@example.com" matches on both the PERSON_NAME and EMAIL_ADDRESS detectors.

...
    "inspectConfig":{
      "ruleSet":[
        {
          "infoTypes":[
            {
              "name":"PERSON_NAME"
            }
          ],
          "rules":[
            {
              "exclusionRule":{
                "excludeInfoTypes":{
                  "infoTypes":[
                    {
                      "name":"EMAIL_ADDRESS"
                    }
                  ]
                },
                "matchingType": "MATCHING_TYPE_PARTIAL_MATCH"
              }
            }
          ]
        }
      ]
    } 
...