如何忽略文本中的指定单词？

Question

我在 App Engine 中使用 java 版本的翻译 API。有没有办法忽略翻译中的特定单词，例如： "Translate IGNORED_TEXT this"，对于某些语言，IGNORED_TEXT 格式不正确，不能保证翻译 API 不会更改它。

Answer 1

经过多次尝试，我最终得到了一种重试器，它对我想忽略的文本使用了特殊字符。在我的例子中，它们是字符串参数（%d、%s 等）。也许这会对某人有所帮助：

public class Parser {

public static final String[] MAGIC_PARAMETER_STRING = {"975313579", "*****", "˨", "இ", "⏲"};
public static final String[] MAGIC_PARAMETER_NUMBER = {"975323579", "*******", "Ω", "˧", "\u23FA"};
private static final String formatSpecifier
        = "%(\d+\$)?([-#+ 0,(\<]*)?(\d+)?(\.\d+)?([tT])?([a-zA-Z%])";
private static final Pattern formatToken = Pattern.compile(formatSpecifier);
private final int maxStringParameterCount = Parser.MAGIC_PARAMETER_STRING.length;
private final int maxNumberParameterCount = Parser.MAGIC_PARAMETER_NUMBER.length;
private int stringPos = 0;
private int numberPos = 0;

private String convertToken(ConvertedString result, String index, String flags, String width, String precision, String temporal, String conversion, String numberReplacement, String stringReplacement) {
    if (conversion.equals("s")) {
        result.stringArgCount++;
        return stringReplacement;
    } else if (conversion.equals("d")) {
        result.numberArgCount++;
        return numberReplacement;
    }
    throw new IllegalArgumentException("%" + index + flags + width + precision + temporal + conversion);
}

private String getReplacementNumber(boolean bumpUp) throws RetryExceededException {
    if (bumpUp) {
        ++numberPos;
    }
    if (numberPos >= maxNumberParameterCount) {
        throw new RetryExceededException();
    }
    return MAGIC_PARAMETER_NUMBER[numberPos];
}

private String getReplacementString(boolean bumpUp) throws RetryExceededException {
    if (bumpUp) {
        ++stringPos;
    }
    if (stringPos >= maxStringParameterCount) {
        throw new RetryExceededException();
    }
    return MAGIC_PARAMETER_STRING[stringPos];
}

public ConvertedString revert(String text) throws RetryExceededException {
    ConvertedString convertedString = new ConvertedString();
    String replacementString = getReplacementString(false);
    String replacementNumber = getReplacementNumber(false);
    convertedString.stringArgCount = StringUtils.countMatches(text, replacementString);
    convertedString.numberArgCount = StringUtils.countMatches(text, replacementNumber);
    String result = text.replace(replacementString, "%s");
    result = result.replace(replacementNumber, "%d");
    convertedString.result = result;
    return convertedString;
}

public ConvertedString convert(final String format) {
    return convert(format, MAGIC_PARAMETER_NUMBER[0], MAGIC_PARAMETER_STRING[0]);
}

public ConvertedString convert(final String format, String numberReplacement, String stringReplacement) {
    ConvertedString result = new ConvertedString();
    final StringBuilder regex = new StringBuilder();
    final Matcher matcher = formatToken.matcher(format);
    int lastIndex = 0;
    while (matcher.find()) {
        regex.append(format.substring(lastIndex, matcher.start()));
        regex.append(convertToken(result, matcher.group(1), matcher.group(2), matcher.group(3),
                matcher.group(4), matcher.group(5), matcher.group(6), numberReplacement, stringReplacement));
        lastIndex = matcher.end();
    }
    regex.append(format.substring(lastIndex, format.length()));
    result.result = regex.toString();
    return result;
}

public ConvertedString retryConvert(String originalText, boolean bumpUpString, boolean bumpUpNumber) throws RetryExceededException {
    String replacementNumber = getReplacementNumber(bumpUpNumber);
    String replacementString = getReplacementString(bumpUpString);
    return convert(originalText, replacementNumber, replacementString);
}

public static class ConvertedString {
    public int stringArgCount;
    public int numberArgCount;
    public String result;

}

public static class RetryExceededException extends Exception {

}
}

Answer 2

解决方案 #1

将 IGNORED_TEXT 替换为 <span class="notranslate">IGNORED_TEXT</span>。

编辑：在 https://translate.google.com/, but it will work using API on https://translation.googleapis.com/language/translate/v2 上使用 GUI 将无法工作。

解决方案 #2:

~~将 IGNORED_TEXT 替换为其 md5，翻译所有内容，然后将其替换回去。（适用于 %s、%1$s、abc）~~ - 编辑：不适用于某些语言，例如塞尔维亚语。

Answer 3

谢谢 @ViliusL，你的回答让我找到了问题的解决方案。我一直在努力将部分文本排除在翻译之外。目前我还没有找到任何提示（Whosebug，google）所以我在这个主题中留下答案。

在我的案例中，问题出在错误的 MIME 类型上。如果您使用 google 云翻译 api（版本 2 或 3 - 哪个并不重要），您必须设置 mime 类型“text/html " 而不是 "text/plain"。如果你有 text/plain mime 类型 google 将忽略部分 html 标签和 class="notranslate"。示例如下：

   TranslateTextResponse requestForTranslation() {
      try (TranslationServiceClient client = googleTranslationServiceProvider.getClient()) {
         return client.translateText(buildRequest());
      }
   }


    TranslateTextRequest buildRequest() {
        return TranslateTextRequest.newBuilder()
                .setParent("YOUR_PARENT")
                .setMimeType("text/html") // HERE should be text/html
                .setSourceLanguageCode("DE")
                .setTargetLanguageCode("EN")
                .addContents("<span class=\"notranslate\">etwas auf deutsch</span>")
                .build();
    }

参考：https://cloud.google.com/translate/docs/supported-formats

PS。我注意到您可以使用标签“<>”来排除翻译，如下所示：

 .addContents("<etwas auf deutsch>")

如何忽略文本中的指定单词？

How to ignore specified word in text?

google-translate