如何拆分java中连在一起的单词？

Question

例如：HelloWorld

预计：Hello World

我尝试使用 Solr 的标记器，但没有找到合适的标记器来执行此操作。我该怎么办？

Answer 1

如果分词器接受正则表达式，您可以使用以下模式作为分词：

(?<=[a-z])(?=[A-Z])

示例 Java 代码：

String input = "HelloWorld";
String[] words = input.split("(?<=[a-z])(?=[A-Z])");
System.out.println(Arrays.toString(words));  // [Hello, World]

Answer 2

你可以使用

String.split(condition);

示例：

String words = "HelloWorldHi";
words.split("regex"); // This will give you an array of words ["Hello", "World", "Hi"]

正则表达式示例：RegExr Example

[A-Z][a-z]{1,}

细分：

[A-Z]: Match any character in the set (From A to Z)
[a-z]: Match any character in the set. From (a to z)
{1, }: Matches the specified quantity of the previous token. {1,3} will match 1 to 3. {3} will match exactly 3. {1,} will match 1 or more.

Answer 3

DictionaryCompoundWordFilter is built for this in Solr;它不是标记器，但它在标记器之后用作过滤器，将已知单词从子字符串拆分为单独的标记。这在英语以外的许多其他语言中特别有用，但在这里也很有价值。

你给它一个你选择的语言的有效单词字典（在你的例子中，这些将是 hello 和 world ），过滤器将这些提取到单独的标记：

Assume that germanwords.txt contains at least the following words: dumm kopf donau dampf schiff
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
</analyzer>
In: "Donaudampfschiff dummkopf"

Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),

Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)

如何拆分java中连在一起的单词？

How to split words that are connected together in java？

java

solr