将单词添加到语言工具建议列表
Add words to languagetool suggesting list
我在我的应用程序中使用 LanguageTool 进行一些拼写检查和拼写更正功能。
LanguageTool documentation 描述了如何从拼写检查中排除单词(通过调用您正在使用的拼写检查规则的 addIgnoreTokens(...)
方法)。
如何将一些词(例如来自特定词典的词)添加到拼写检查中?也就是说,LanguageTool 可以修复拼写错误的单词并从我的特定词典中推荐单词吗?
不幸的是,我认为 API 不支持这一点。如果没有 API,您可以将单词添加到 spelling.txt
以使它们被接受并用作建议。对于 API,您可能需要扩展 MorfologikSpellerRule 并更改 this place of the code。 (披露:我是 LanguageTool 的维护者)
我有类似的需求,就是将一些自定义词加载到字典中作为 "suggest words",而不仅仅是 "ignored words"。最后,我扩展了 MorfologikSpellerRule 来执行此操作:
- 创建 class MorfologikSpellerRuleEx 从 MorfologikSpellerRule 扩展,覆盖方法 "match()",并编写我自己的 "initSpeller()" 来创建拼写器。
- 然后对于语言工具,创建此自定义拼写规则以替换现有规则。
代码:
Language lang = new AmericanEnglish();
JLanguageTool langTool = new JLanguageTool(lang);
langTool.disableRule("MORFOLOGIK_RULE_EN_US");
try {
MorfologikSpellerRuleEx spellingRule = new MorfologikSpellerRuleEx(JLanguageTool.getMessageBundle(), lang);
spellingRule.setSpellingFilePath(spellingFilePath);
//spellingFilePath is the file has my own words + words from /hunspell/spelling_en-US.txt
langTool.addRule(spellingRule);
} catch (IOException e) {
e.printStackTrace();
}
我自定义的MorfologikSpellerRuleEx代码:
public class MorfologikSpellerRuleEx extends MorfologikSpellerRule {
private String spellingFilePath = null;
private boolean ignoreTaggedWords = false;
public MorfologikSpellerRuleEx(ResourceBundle messages, Language language) throws IOException {
super(messages, language);
}
@Override
public String getFileName() {
return "/en/hunspell/en_US.dict";
}
@Override
public String getId() {
return "MORFOLOGIK_SPELLING_RULE_EX";
}
@Override
public void setIgnoreTaggedWords() {
ignoreTaggedWords = true;
}
public String getSpellingFilePath() {
return spellingFilePath;
}
public void setSpellingFilePath(String spellingFilePath) {
this.spellingFilePath = spellingFilePath;
}
private void initSpellerEx(String binaryDict) throws IOException {
String plainTextDict = null;
if (JLanguageTool.getDataBroker().resourceExists(getSpellingFileName())) {
plainTextDict = getSpellingFileName();
}
if (plainTextDict != null) {
BufferedReader br = null;
if (this.spellingFilePath != null) {
try {
br = new BufferedReader(new FileReader(this.spellingFilePath));
}
catch (Exception e) {
br = null;
}
}
if (br != null) {
speller1 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 1);
speller2 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 2);
speller3 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 3);
br.close();
}
else {
speller1 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 1);
speller2 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 2);
speller3 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 3);
}
setConvertsCase(speller1.convertsCase());
} else {
throw new RuntimeException("Could not find ignore spell file in path: " + getSpellingFileName());
}
}
private boolean canBeIgnored(AnalyzedTokenReadings[] tokens, int idx, AnalyzedTokenReadings token)
throws IOException {
return token.isSentenceStart() || token.isImmunized() || token.isIgnoredBySpeller() || isUrl(token.getToken())
|| isEMail(token.getToken()) || (ignoreTaggedWords && token.isTagged()) || ignoreToken(tokens, idx);
}
@Override
public RuleMatch[] match(AnalyzedSentence sentence) throws IOException {
List<RuleMatch> ruleMatches = new ArrayList<>();
AnalyzedTokenReadings[] tokens = getSentenceWithImmunization(sentence).getTokensWithoutWhitespace();
// lazy init
if (speller1 == null) {
String binaryDict = null;
if (JLanguageTool.getDataBroker().resourceExists(getFileName())) {
binaryDict = getFileName();
}
if (binaryDict != null) {
initSpellerEx(binaryDict); //here's the change
} else {
// should not happen, as we only configure this rule (or rather its subclasses)
// when we have the resources:
return toRuleMatchArray(ruleMatches);
}
}
int idx = -1;
for (AnalyzedTokenReadings token : tokens) {
idx++;
if (canBeIgnored(tokens, idx, token)) {
continue;
}
// if we use token.getToken() we'll get ignored characters inside and speller
// will choke
String word = token.getAnalyzedToken(0).getToken();
if (tokenizingPattern() == null) {
ruleMatches.addAll(getRuleMatches(word, token.getStartPos(), sentence));
} else {
int index = 0;
Matcher m = tokenizingPattern().matcher(word);
while (m.find()) {
String match = word.subSequence(index, m.start()).toString();
ruleMatches.addAll(getRuleMatches(match, token.getStartPos() + index, sentence));
index = m.end();
}
if (index == 0) { // tokenizing char not found
ruleMatches.addAll(getRuleMatches(word, token.getStartPos(), sentence));
} else {
ruleMatches.addAll(getRuleMatches(word.subSequence(index, word.length()).toString(),
token.getStartPos() + index, sentence));
}
}
}
return toRuleMatchArray(ruleMatches);
}
}
我在我的应用程序中使用 LanguageTool 进行一些拼写检查和拼写更正功能。
LanguageTool documentation 描述了如何从拼写检查中排除单词(通过调用您正在使用的拼写检查规则的 addIgnoreTokens(...)
方法)。
如何将一些词(例如来自特定词典的词)添加到拼写检查中?也就是说,LanguageTool 可以修复拼写错误的单词并从我的特定词典中推荐单词吗?
不幸的是,我认为 API 不支持这一点。如果没有 API,您可以将单词添加到 spelling.txt
以使它们被接受并用作建议。对于 API,您可能需要扩展 MorfologikSpellerRule 并更改 this place of the code。 (披露:我是 LanguageTool 的维护者)
我有类似的需求,就是将一些自定义词加载到字典中作为 "suggest words",而不仅仅是 "ignored words"。最后,我扩展了 MorfologikSpellerRule 来执行此操作:
- 创建 class MorfologikSpellerRuleEx 从 MorfologikSpellerRule 扩展,覆盖方法 "match()",并编写我自己的 "initSpeller()" 来创建拼写器。
- 然后对于语言工具,创建此自定义拼写规则以替换现有规则。
代码:
Language lang = new AmericanEnglish();
JLanguageTool langTool = new JLanguageTool(lang);
langTool.disableRule("MORFOLOGIK_RULE_EN_US");
try {
MorfologikSpellerRuleEx spellingRule = new MorfologikSpellerRuleEx(JLanguageTool.getMessageBundle(), lang);
spellingRule.setSpellingFilePath(spellingFilePath);
//spellingFilePath is the file has my own words + words from /hunspell/spelling_en-US.txt
langTool.addRule(spellingRule);
} catch (IOException e) {
e.printStackTrace();
}
我自定义的MorfologikSpellerRuleEx代码:
public class MorfologikSpellerRuleEx extends MorfologikSpellerRule {
private String spellingFilePath = null;
private boolean ignoreTaggedWords = false;
public MorfologikSpellerRuleEx(ResourceBundle messages, Language language) throws IOException {
super(messages, language);
}
@Override
public String getFileName() {
return "/en/hunspell/en_US.dict";
}
@Override
public String getId() {
return "MORFOLOGIK_SPELLING_RULE_EX";
}
@Override
public void setIgnoreTaggedWords() {
ignoreTaggedWords = true;
}
public String getSpellingFilePath() {
return spellingFilePath;
}
public void setSpellingFilePath(String spellingFilePath) {
this.spellingFilePath = spellingFilePath;
}
private void initSpellerEx(String binaryDict) throws IOException {
String plainTextDict = null;
if (JLanguageTool.getDataBroker().resourceExists(getSpellingFileName())) {
plainTextDict = getSpellingFileName();
}
if (plainTextDict != null) {
BufferedReader br = null;
if (this.spellingFilePath != null) {
try {
br = new BufferedReader(new FileReader(this.spellingFilePath));
}
catch (Exception e) {
br = null;
}
}
if (br != null) {
speller1 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 1);
speller2 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 2);
speller3 = new MorfologikMultiSpeller(binaryDict, br, plainTextDict, 3);
br.close();
}
else {
speller1 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 1);
speller2 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 2);
speller3 = new MorfologikMultiSpeller(binaryDict, plainTextDict, 3);
}
setConvertsCase(speller1.convertsCase());
} else {
throw new RuntimeException("Could not find ignore spell file in path: " + getSpellingFileName());
}
}
private boolean canBeIgnored(AnalyzedTokenReadings[] tokens, int idx, AnalyzedTokenReadings token)
throws IOException {
return token.isSentenceStart() || token.isImmunized() || token.isIgnoredBySpeller() || isUrl(token.getToken())
|| isEMail(token.getToken()) || (ignoreTaggedWords && token.isTagged()) || ignoreToken(tokens, idx);
}
@Override
public RuleMatch[] match(AnalyzedSentence sentence) throws IOException {
List<RuleMatch> ruleMatches = new ArrayList<>();
AnalyzedTokenReadings[] tokens = getSentenceWithImmunization(sentence).getTokensWithoutWhitespace();
// lazy init
if (speller1 == null) {
String binaryDict = null;
if (JLanguageTool.getDataBroker().resourceExists(getFileName())) {
binaryDict = getFileName();
}
if (binaryDict != null) {
initSpellerEx(binaryDict); //here's the change
} else {
// should not happen, as we only configure this rule (or rather its subclasses)
// when we have the resources:
return toRuleMatchArray(ruleMatches);
}
}
int idx = -1;
for (AnalyzedTokenReadings token : tokens) {
idx++;
if (canBeIgnored(tokens, idx, token)) {
continue;
}
// if we use token.getToken() we'll get ignored characters inside and speller
// will choke
String word = token.getAnalyzedToken(0).getToken();
if (tokenizingPattern() == null) {
ruleMatches.addAll(getRuleMatches(word, token.getStartPos(), sentence));
} else {
int index = 0;
Matcher m = tokenizingPattern().matcher(word);
while (m.find()) {
String match = word.subSequence(index, m.start()).toString();
ruleMatches.addAll(getRuleMatches(match, token.getStartPos() + index, sentence));
index = m.end();
}
if (index == 0) { // tokenizing char not found
ruleMatches.addAll(getRuleMatches(word, token.getStartPos(), sentence));
} else {
ruleMatches.addAll(getRuleMatches(word.subSequence(index, word.length()).toString(),
token.getStartPos() + index, sentence));
}
}
}
return toRuleMatchArray(ruleMatches);
}
}