雪球词干:定义区域

Snowball Stemming: defining Regions

我正在尝试了解 snoball 词干提取算法。该算法使用两个区域 R1 和 R2,定义如下:

R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if there is no such non-vowel.

R2 is the region after the first non-vowel following a vowel in R1, or is the null region at the end of the word if there is no such non-vowel.

http://snowball.tartarus.org/texts/r1r2.html

例子是

    b   e   a   u   t   i   f   u   l
                      |<------------->|    R1
                              |<----->|    R2

   b   e   a   u   t   y
                     |<->|    R1
                       ->|<-  R2

   a   n   i   m   a   d   v   e   r   s   i   o   n
        |<----------------------------------------->|    R1
                |<--------------------------------->|    R2

   s   p   r   i   n   k   l   e   d
                     |<------------->|    R1
                                   ->|<-  R2

    e   u   c   h   a   r   i   s   t
            |<--------------------->|    R1
                        |<--------->|    R2

我的问题是,为什么springkled中的“kled”和圣餐中的“harist”定义为R1?我认为正确的结果应该是“inkled”和“arist”?

你应该再读一遍定义,它说:

R1 is the region after the first non-vowel following a vowel.

不是:后跟[=​​27=]一个元音。

sprinkled中,元音后的第一个非元音是n,所以后面的区域是kled

eucharist相同,元音后的第一个非元音是c,所以后面的区域是harist