如何忽略文本中的指定单词?
How to ignore specified word in text?
我在 App Engine 中使用 java 版本的翻译 API。
有没有办法忽略翻译中的特定单词,例如:
"Translate IGNORED_TEXT this",对于某些语言,IGNORED_TEXT 格式不正确,不能保证翻译 API 不会更改它。
经过多次尝试,我最终得到了一种重试器,它对我想忽略的文本使用了特殊字符。在我的例子中,它们是字符串参数(%d、%s 等)。也许这会对某人有所帮助:
public class Parser {
public static final String[] MAGIC_PARAMETER_STRING = {"975313579", "*****", "˨", "இ", "⏲"};
public static final String[] MAGIC_PARAMETER_NUMBER = {"975323579", "*******", "Ω", "˧", "\u23FA"};
private static final String formatSpecifier
= "%(\d+\$)?([-#+ 0,(\<]*)?(\d+)?(\.\d+)?([tT])?([a-zA-Z%])";
private static final Pattern formatToken = Pattern.compile(formatSpecifier);
private final int maxStringParameterCount = Parser.MAGIC_PARAMETER_STRING.length;
private final int maxNumberParameterCount = Parser.MAGIC_PARAMETER_NUMBER.length;
private int stringPos = 0;
private int numberPos = 0;
private String convertToken(ConvertedString result, String index, String flags, String width, String precision, String temporal, String conversion, String numberReplacement, String stringReplacement) {
if (conversion.equals("s")) {
result.stringArgCount++;
return stringReplacement;
} else if (conversion.equals("d")) {
result.numberArgCount++;
return numberReplacement;
}
throw new IllegalArgumentException("%" + index + flags + width + precision + temporal + conversion);
}
private String getReplacementNumber(boolean bumpUp) throws RetryExceededException {
if (bumpUp) {
++numberPos;
}
if (numberPos >= maxNumberParameterCount) {
throw new RetryExceededException();
}
return MAGIC_PARAMETER_NUMBER[numberPos];
}
private String getReplacementString(boolean bumpUp) throws RetryExceededException {
if (bumpUp) {
++stringPos;
}
if (stringPos >= maxStringParameterCount) {
throw new RetryExceededException();
}
return MAGIC_PARAMETER_STRING[stringPos];
}
public ConvertedString revert(String text) throws RetryExceededException {
ConvertedString convertedString = new ConvertedString();
String replacementString = getReplacementString(false);
String replacementNumber = getReplacementNumber(false);
convertedString.stringArgCount = StringUtils.countMatches(text, replacementString);
convertedString.numberArgCount = StringUtils.countMatches(text, replacementNumber);
String result = text.replace(replacementString, "%s");
result = result.replace(replacementNumber, "%d");
convertedString.result = result;
return convertedString;
}
public ConvertedString convert(final String format) {
return convert(format, MAGIC_PARAMETER_NUMBER[0], MAGIC_PARAMETER_STRING[0]);
}
public ConvertedString convert(final String format, String numberReplacement, String stringReplacement) {
ConvertedString result = new ConvertedString();
final StringBuilder regex = new StringBuilder();
final Matcher matcher = formatToken.matcher(format);
int lastIndex = 0;
while (matcher.find()) {
regex.append(format.substring(lastIndex, matcher.start()));
regex.append(convertToken(result, matcher.group(1), matcher.group(2), matcher.group(3),
matcher.group(4), matcher.group(5), matcher.group(6), numberReplacement, stringReplacement));
lastIndex = matcher.end();
}
regex.append(format.substring(lastIndex, format.length()));
result.result = regex.toString();
return result;
}
public ConvertedString retryConvert(String originalText, boolean bumpUpString, boolean bumpUpNumber) throws RetryExceededException {
String replacementNumber = getReplacementNumber(bumpUpNumber);
String replacementString = getReplacementString(bumpUpString);
return convert(originalText, replacementNumber, replacementString);
}
public static class ConvertedString {
public int stringArgCount;
public int numberArgCount;
public String result;
}
public static class RetryExceededException extends Exception {
}
}
解决方案 #1
将 IGNORED_TEXT
替换为 <span class="notranslate">IGNORED_TEXT</span>
。
编辑:在 https://translate.google.com/, but it will work using API on https://translation.googleapis.com/language/translate/v2 上使用 GUI 将无法工作。
解决方案 #2:
将 IGNORED_TEXT
替换为其 md5,翻译所有内容,然后将其替换回去。 (适用于 %s、%1$s、abc) - 编辑:不适用于某些语言,例如塞尔维亚语。
谢谢 @ViliusL,你的回答让我找到了问题的解决方案。我一直在努力将部分文本排除在翻译之外。目前我还没有找到任何提示(Whosebug,google)所以我在这个主题中留下答案。
在我的案例中,问题出在错误的 MIME 类型上。如果您使用 google 云翻译 api(版本 2 或 3 - 哪个并不重要),您必须设置 mime 类型“text/html " 而不是 "text/plain"。如果你有 text/plain mime 类型 google 将忽略部分 html 标签和 class="notranslate"。示例如下:
TranslateTextResponse requestForTranslation() {
try (TranslationServiceClient client = googleTranslationServiceProvider.getClient()) {
return client.translateText(buildRequest());
}
}
TranslateTextRequest buildRequest() {
return TranslateTextRequest.newBuilder()
.setParent("YOUR_PARENT")
.setMimeType("text/html") // HERE should be text/html
.setSourceLanguageCode("DE")
.setTargetLanguageCode("EN")
.addContents("<span class=\"notranslate\">etwas auf deutsch</span>")
.build();
}
参考:https://cloud.google.com/translate/docs/supported-formats
PS。
我注意到您可以使用标签“<>”来排除翻译,如下所示:
.addContents("<etwas auf deutsch>")
我在 App Engine 中使用 java 版本的翻译 API。 有没有办法忽略翻译中的特定单词,例如: "Translate IGNORED_TEXT this",对于某些语言,IGNORED_TEXT 格式不正确,不能保证翻译 API 不会更改它。
经过多次尝试,我最终得到了一种重试器,它对我想忽略的文本使用了特殊字符。在我的例子中,它们是字符串参数(%d、%s 等)。也许这会对某人有所帮助:
public class Parser {
public static final String[] MAGIC_PARAMETER_STRING = {"975313579", "*****", "˨", "இ", "⏲"};
public static final String[] MAGIC_PARAMETER_NUMBER = {"975323579", "*******", "Ω", "˧", "\u23FA"};
private static final String formatSpecifier
= "%(\d+\$)?([-#+ 0,(\<]*)?(\d+)?(\.\d+)?([tT])?([a-zA-Z%])";
private static final Pattern formatToken = Pattern.compile(formatSpecifier);
private final int maxStringParameterCount = Parser.MAGIC_PARAMETER_STRING.length;
private final int maxNumberParameterCount = Parser.MAGIC_PARAMETER_NUMBER.length;
private int stringPos = 0;
private int numberPos = 0;
private String convertToken(ConvertedString result, String index, String flags, String width, String precision, String temporal, String conversion, String numberReplacement, String stringReplacement) {
if (conversion.equals("s")) {
result.stringArgCount++;
return stringReplacement;
} else if (conversion.equals("d")) {
result.numberArgCount++;
return numberReplacement;
}
throw new IllegalArgumentException("%" + index + flags + width + precision + temporal + conversion);
}
private String getReplacementNumber(boolean bumpUp) throws RetryExceededException {
if (bumpUp) {
++numberPos;
}
if (numberPos >= maxNumberParameterCount) {
throw new RetryExceededException();
}
return MAGIC_PARAMETER_NUMBER[numberPos];
}
private String getReplacementString(boolean bumpUp) throws RetryExceededException {
if (bumpUp) {
++stringPos;
}
if (stringPos >= maxStringParameterCount) {
throw new RetryExceededException();
}
return MAGIC_PARAMETER_STRING[stringPos];
}
public ConvertedString revert(String text) throws RetryExceededException {
ConvertedString convertedString = new ConvertedString();
String replacementString = getReplacementString(false);
String replacementNumber = getReplacementNumber(false);
convertedString.stringArgCount = StringUtils.countMatches(text, replacementString);
convertedString.numberArgCount = StringUtils.countMatches(text, replacementNumber);
String result = text.replace(replacementString, "%s");
result = result.replace(replacementNumber, "%d");
convertedString.result = result;
return convertedString;
}
public ConvertedString convert(final String format) {
return convert(format, MAGIC_PARAMETER_NUMBER[0], MAGIC_PARAMETER_STRING[0]);
}
public ConvertedString convert(final String format, String numberReplacement, String stringReplacement) {
ConvertedString result = new ConvertedString();
final StringBuilder regex = new StringBuilder();
final Matcher matcher = formatToken.matcher(format);
int lastIndex = 0;
while (matcher.find()) {
regex.append(format.substring(lastIndex, matcher.start()));
regex.append(convertToken(result, matcher.group(1), matcher.group(2), matcher.group(3),
matcher.group(4), matcher.group(5), matcher.group(6), numberReplacement, stringReplacement));
lastIndex = matcher.end();
}
regex.append(format.substring(lastIndex, format.length()));
result.result = regex.toString();
return result;
}
public ConvertedString retryConvert(String originalText, boolean bumpUpString, boolean bumpUpNumber) throws RetryExceededException {
String replacementNumber = getReplacementNumber(bumpUpNumber);
String replacementString = getReplacementString(bumpUpString);
return convert(originalText, replacementNumber, replacementString);
}
public static class ConvertedString {
public int stringArgCount;
public int numberArgCount;
public String result;
}
public static class RetryExceededException extends Exception {
}
}
解决方案 #1
将 IGNORED_TEXT
替换为 <span class="notranslate">IGNORED_TEXT</span>
。
编辑:在 https://translate.google.com/, but it will work using API on https://translation.googleapis.com/language/translate/v2 上使用 GUI 将无法工作。
解决方案 #2:
将 IGNORED_TEXT
替换为其 md5,翻译所有内容,然后将其替换回去。 (适用于 %s、%1$s、abc) - 编辑:不适用于某些语言,例如塞尔维亚语。
谢谢 @ViliusL,你的回答让我找到了问题的解决方案。我一直在努力将部分文本排除在翻译之外。目前我还没有找到任何提示(Whosebug,google)所以我在这个主题中留下答案。
在我的案例中,问题出在错误的 MIME 类型上。如果您使用 google 云翻译 api(版本 2 或 3 - 哪个并不重要),您必须设置 mime 类型“text/html " 而不是 "text/plain"。如果你有 text/plain mime 类型 google 将忽略部分 html 标签和 class="notranslate"。示例如下:
TranslateTextResponse requestForTranslation() {
try (TranslationServiceClient client = googleTranslationServiceProvider.getClient()) {
return client.translateText(buildRequest());
}
}
TranslateTextRequest buildRequest() {
return TranslateTextRequest.newBuilder()
.setParent("YOUR_PARENT")
.setMimeType("text/html") // HERE should be text/html
.setSourceLanguageCode("DE")
.setTargetLanguageCode("EN")
.addContents("<span class=\"notranslate\">etwas auf deutsch</span>")
.build();
}
参考:https://cloud.google.com/translate/docs/supported-formats
PS。 我注意到您可以使用标签“<>”来排除翻译,如下所示:
.addContents("<etwas auf deutsch>")