如何拆分java中连在一起的单词?
How to split words that are connected together in java?
例如:HelloWorld
预计:Hello World
我尝试使用 Solr 的标记器,但没有找到合适的标记器来执行此操作。
我该怎么办?
如果分词器接受正则表达式,您可以使用以下模式作为分词:
(?<=[a-z])(?=[A-Z])
示例 Java 代码:
String input = "HelloWorld";
String[] words = input.split("(?<=[a-z])(?=[A-Z])");
System.out.println(Arrays.toString(words)); // [Hello, World]
你可以使用
String.split(condition);
示例:
String words = "HelloWorldHi";
words.split("regex"); // This will give you an array of words ["Hello", "World", "Hi"]
正则表达式示例:RegExr Example
[A-Z][a-z]{1,}
细分:
[A-Z]: Match any character in the set (From A to Z)
[a-z]: Match any character in the set. From (a to z)
{1, }: Matches the specified quantity of the previous token. {1,3} will match 1 to 3. {3} will match exactly 3. {1,} will match 1 or more.
DictionaryCompoundWordFilter is built for this in Solr;它不是标记器,但它在标记器之后用作过滤器,将已知单词从子字符串拆分为单独的标记。这在英语以外的许多其他语言中特别有用,但在这里也很有价值。
你给它一个你选择的语言的有效单词字典(在你的例子中,这些将是 hello
和 world
),过滤器将这些提取到单独的标记:
Assume that germanwords.txt contains at least the following words: dumm kopf donau dampf schiff
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
</analyzer>
In: "Donaudampfschiff dummkopf"
Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),
Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)
例如:HelloWorld
预计:Hello World
我尝试使用 Solr 的标记器,但没有找到合适的标记器来执行此操作。 我该怎么办?
如果分词器接受正则表达式,您可以使用以下模式作为分词:
(?<=[a-z])(?=[A-Z])
示例 Java 代码:
String input = "HelloWorld";
String[] words = input.split("(?<=[a-z])(?=[A-Z])");
System.out.println(Arrays.toString(words)); // [Hello, World]
你可以使用
String.split(condition);
示例:
String words = "HelloWorldHi";
words.split("regex"); // This will give you an array of words ["Hello", "World", "Hi"]
正则表达式示例:RegExr Example
[A-Z][a-z]{1,}
细分:
[A-Z]: Match any character in the set (From A to Z)
[a-z]: Match any character in the set. From (a to z)
{1, }: Matches the specified quantity of the previous token. {1,3} will match 1 to 3. {3} will match exactly 3. {1,} will match 1 or more.
DictionaryCompoundWordFilter is built for this in Solr;它不是标记器,但它在标记器之后用作过滤器,将已知单词从子字符串拆分为单独的标记。这在英语以外的许多其他语言中特别有用,但在这里也很有价值。
你给它一个你选择的语言的有效单词字典(在你的例子中,这些将是 hello
和 world
),过滤器将这些提取到单独的标记:
Assume that germanwords.txt contains at least the following words:
dumm kopf donau dampf schiff
<analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/> </analyzer>
In:
"Donaudampfschiff dummkopf"
Tokenizer to Filter:
"Donaudampfschiff"(1), "dummkopf"(2),
Out:
"Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)