来自 DLP API 的不同结果取决于输入是全部在一个字符串中还是作为子字符串集合发送

Question

我在 Google DLP 库中看到一个令我困惑的行为，我希望得到一些澄清。我正在使用 Java 包装器库，google-cloud-dlp 版本 0.34.0-beta。给定输入：

Collection<String> input = Lists.newArrayList("Jenny Tutone  2665 Agua Vista Dr Los Gatos CA 95030 (408) 867-5309 or 408.867.5309x100"

我看到了输出：

███  █ ████ or █

如果我传入相同的字符串作为子字符串集合：

Collection<String> input = Lists.newArrayList("Jenny Tutone", "2665 Agua Vista Dr", "Los Gatos", "CA 95030", "(408) 867-5309", "or", "408.867.5309x100");

我看到非常不同的结果：

███, 2665 █, █ Gatos, █ 95030, █, or, █

我使用了我能找到的所有 InfoType 类型，共计 67 种。我在这里做错了什么吗？这是调用 Google DLP 库的代码的主要部分：

private Collection<String> redactContent(Collection<String> input,
                                String replacement,
                                Likelihood minLikelihood,
                                List<InfoType> infoTypes) {
    // Replace select info types with chosen replacement string
    final Collection<RedactContentRequest.ReplaceConfig> replaceConfigs = infoTypes.stream()
            .map(it -> RedactContentRequest.ReplaceConfig.newBuilder().setInfoType(it).setReplaceWith(replacement).build())
            .collect(Collectors.toCollection(LinkedList::new));

    final InspectConfig inspectConfig =
            InspectConfig.newBuilder()
                    .addAllInfoTypes(infoTypes)
                    .setMinLikelihood(minLikelihood)
                    .build();

    long itemCount = 0;

    try (DlpServiceClient dlpClient = DlpServiceClient.create(settings)) {
        // Google's DLP library is limited to 100 items per request, so the requests need to be chunked if the
        // number of input items is greater.

        Stream.Builder<Stream<ContentItem>> streamBuilder = Stream.builder();

        for (long processed = 0; processed < input.size(); processed += maxItemsPerRequest) {
            Collection<ContentItem> items =
                    input.stream()
                            .skip(processed)
                            .limit(maxItemsPerRequest)
                            .filter(item -> item != null && !item.isEmpty())
                            .map(item ->
                                    ContentItem.newBuilder()
                                            .setType(MediaType.PLAIN_TEXT_UTF_8.toString())
                                            .setData(ByteString.copyFrom(item.getBytes(Charset.forName("UTF-8"))))
                                            .build()
                            )
                            .collect(Collectors.toCollection(LinkedList::new));
            RedactContentRequest request = RedactContentRequest.newBuilder()
                    .setInspectConfig(inspectConfig)
                    .addAllItems(Collections.unmodifiableCollection(items))
                    .addAllReplaceConfigs(replaceConfigs)
                    .build();

            RedactContentResponse contentResponse = dlpClient.redactContent(request);
            itemCount += contentResponse.getItemsCount();
            streamBuilder.add(contentResponse.getItemsList().stream());
        }

        return streamBuilder.build()
                        .flatMap(stream -> stream.map(item -> item.getData().toStringUtf8()))
                        .collect(Collectors.toCollection(LinkedList::new));
    }
}

Answer 1

背景会影响调查结果。同样在地址的情况下，地址的某些部分可能会影响其他部分。例如，"Mountain View CA 94043" 可能匹配为 LOCATION，但“94043”本身可能不匹配。当运行此分析时，我们在决定上下文时不会跨越单元格边界，因此在您的第二个 ArrayList 示例中，每个字符串都会单独查看（在其自己的上下文中）。

注意：我是 DLP API 的 PM。

来自 DLP API 的不同结果取决于输入是全部在一个字符串中还是作为子字符串集合发送

Different results from DLP API depending on if input is all in one string or sent in as collection of substrings

google-cloud-platform

google-cloud-dlp