确定撇号是引号还是缩写

figuring out if an apostrophe is a quote or contraction

我正在寻找一种方法来检查一个句子,看看撇号是引号还是缩写,这样我就可以从字符串中删除标点符号,然后规范化所有单词。

我的测试语句是:don't frazzel the horses. 'she said wow'.

在我的尝试中,我将句子分成单词部分,对单词和非单词进行符号化,如下所示:

contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"]

sentence = "don't frazzel the horses. 'she said wow'.".split(/(\w+)|(\W+)/i).reject! { |word| word.empty? }

这个returns["don", "'", "t", " ", "frazzel", " ", "the", " ", "horses", ". '", "she", " ", "said", " ", "wow", "'."]

接下来我希望能够遍历句子以查找撇号 ',当找到撇号时,比较下一个元素以查看它是否包含在 contractionEndings 数组中。如果包含我想将前缀、撇号 ' 和后缀加入一个索引,否则删除撇号。

在此示例中,don't 将作为单个索引加入 don't,但 . ''. 将被删除。

然后我可以 运行 一个正则表达式从句子中删除其他标点符号,这样我就可以将它传递到我的词干分析器中以规范化输入。

我想要的最终输出是 don't frazzel the horses she said wow,其中除了用于缩写的撇号外,所有标点符号都将被删除。

如果有人对这项工作有任何建议或对如何解决这个问题有更好的想法,我想知道。

总的来说,我想删除句子中除缩略语之外的所有标点符号。

谢谢

这个怎么样?

irb:0> s = "don't frazzel the horses. 'she said wow'."
irb:0> contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"]
irb:0> s.scan(/\w+(?:'(?:#{contractionEndings.join('|')}))?/)
=> ["don't", "frazzel", "the", "horses", "she", "said", "wow"]

正则表达式扫描一些 "word" 个字符,然后可选地(使用 ?)撇号加收缩结尾。您可以像双引号字符串一样替换 Ruby 表达式,这样我们就可以得到我们的收缩,用正则表达式交替运算符 | 连接它们。最后一件事是用 ?: 将组(括号中的部分)标记为非捕获,这样扫描就不会 return 一堆 nils,只是整个匹配迭代。

或者您可能不需要使用此方法的显式缩写结尾列表。感谢 Cary,我还修复了其他有问题的结构。

irb:0> "don't -frazzel's the jack-o'-lantern's handle, ma'am- 'she said hey-ho'.".scan(/\w[-'\w]*\w(?:'\w+)?/)
=> ["don't", "frazzel's", "the", "jack-o'-lantern's", "handle", "ma'am", "she", "said", "hey-ho"]

正如我在评论中提到的,我认为尝试列出所有可能的收缩结尾是徒劳的。事实上,有些缩写,例如 "couldn’t’ve",包含不止一个撇号。

另一个选项是匹配单引号。我的第一个想法是删除字符 "'" 如果它位于句子的开头或 space 之后,或者如果它后面跟着 space 或在句子的结尾.不幸的是,以 "s": "Chris' cat has fleas" 结尾的所有格词使这种方法受挫。更糟糕的是,我们如何解释 "Where are 'Chris' cars'?" 或“'Twas the 'night before Christmas'”?

这是一种在单词的开头或结尾没有撇号时删除单引号的方法(无可否认,这是值得怀疑的)。

r = /
    (?<=\A|\s) # match the beginning of the string or a whitespace char in a
               # positive lookbehind
    \'         # match a single quote
    |          # or 
    \'         # match a single quote
    (?=\s|\z)  # match a whitespace char or the end of the string in a
               # positive lookahead
    /x         # free-spacing regex definition mode

"don't frazzel the horses. 'she said wow'".gsub(r,'')
  #=> "don't frazzel the horses. she said wow" 

我认为最好的解决办法是让英语中的撇号和单引号使用不同的符号。

通常撇号会留在分词后的缩略语中。

尝试使用普通的 NLP 分词器,例如在 python nltk:

>>> from nltk import word_tokenize
>>> word_tokenize("don't frazzel the horses")
['do', "n't", 'frazzel', 'the', 'horses']

对于多个句子:

>>> from string import punctuation
>>> from nltk import sent_tokenize, word_tokenize
>>> text = "don't frazzel the horses. 'she said wow'."
>>> sents = sent_tokenize(text)
>>> sents
["don't frazzel the horses.", "'she said wow'."]
>>> [word for word in word_tokenize(sents[0]) if word not in punctuation]
['do', "n't", 'frazzel', 'the', 'horses']
>>> [word for word in word_tokenize(sents[1]) if word not in punctuation]
["'she", 'said', 'wow']

word_tokenize之前的句子展平:

>>> from itertools import chain
>>> sents
["don't frazzel the horses.", "'she said wow'."]
>>> [word_tokenize(sent) for sent in sents]
[['do', "n't", 'frazzel', 'the', 'horses', '.'], ["'she", 'said', 'wow', "'", '.']]
>>> list(chain(*[word_tokenize(sent) for sent in sents]))
['do', "n't", 'frazzel', 'the', 'horses', '.', "'she", 'said', 'wow', "'", '.']
>>> [word for word in list(chain(*[word_tokenize(sent) for sent in sents])) if word not in punctuation]
['do', "n't", 'frazzel', 'the', 'horses', "'she", 'said', 'wow']

请注意,单引号与 'she 保持一致。可悲的是,在今天对复杂(深度)机器学习方法的所有炒作中,简单的标记化任务仍然有其弱点=(

即使是正式的语法文本也会出错:

>>> text = "Don't frazzel the horses. 'She said wow'."
>>> sents = sent_tokenize(text)
>>> sents
["Don't frazzel the horses.", "'She said wow'."]
>>> [word_tokenize(sent) for sent in sents]
[['Do', "n't", 'frazzel', 'the', 'horses', '.'], ["'She", 'said', 'wow', "'", '.']]

您可以使用 Pragmatic Tokenizer gem. It can detect English contractions.

s = "don't frazzel the horses. 'she said wow'."
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s)
=> ["don't", "frazzel", "the", "horses", "she", "said", "wow"]

s = "'Twas the 'night before Christmas'."
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s)
=> ["'twas", "the", "night", "before", "christmas"]

s = "He couldn’t’ve been right."
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s)
=> ["he", "couldn’t’ve", "been", "right"]