如何拆分java中连在一起的单词?

How to split words that are connected together in java?

例如:HelloWorld

预计:Hello World

我尝试使用 Solr 的标记器,但没有找到合适的标记器来执行此操作。 我该怎么办?

如果分词器接受正则表达式,您可以使用以下模式作为分词:

(?<=[a-z])(?=[A-Z])

示例 Java 代码:

String input = "HelloWorld";
String[] words = input.split("(?<=[a-z])(?=[A-Z])");
System.out.println(Arrays.toString(words));  // [Hello, World]

你可以使用

String.split(condition);

示例:

String words = "HelloWorldHi";
words.split("regex"); // This will give you an array of words ["Hello", "World", "Hi"]

正则表达式示例:RegExr Example

[A-Z][a-z]{1,}

细分:

[A-Z]: Match any character in the set (From A to Z)
[a-z]: Match any character in the set. From (a to z)
{1, }: Matches the specified quantity of the previous token. {1,3} will match 1 to 3. {3} will match exactly 3. {1,} will match 1 or more.

DictionaryCompoundWordFilter is built for this in Solr;它不是标记器,但它在标记器之后用作过滤器,将已知单词从子字符串拆分为单独的标记。这在英语以外的许多其他语言中特别有用,但在这里也很有价值。

你给它一个你选择的语言的有效单词字典(在你的例子中,这些将是 helloworld ),过滤器将这些提取到单独的标记:

Assume that germanwords.txt contains at least the following words: dumm kopf donau dampf schiff

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
</analyzer>

In: "Donaudampfschiff dummkopf"

Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),

Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)