从没有空格的文本中提取有意义的词
Extract meaningful words from spaceless texts
我没做过多少NLP,但是有需求。例如对于字符串 'australiafreedomrally',我需要自动提取有意义的单词,即 'australia'、'freedom' 和 'rally'.
有什么python包可以做到吗?谢谢
查看 this thread, where among other things a package is mentioned which does this. Generally an approach with a predefined list of common words can get you far. Your question has an overlap with the task of Optical Character Recognition (OCR) Post Correction,您可以找到一些预训练模型,尽管问题强烈转向一个问题(缺少空白字符)可能导致它的表现不太好。
如果你想真正进入这个主题,你可以尝试在这个任务上训练一个新模型,我可以想象最近流行的使用 subtoken-level 嵌入未知词的 transformer 模型可以被训练来带来在这个任务上表现不错,因为有些模型的方向与 grammar correction and sentence boundary correction. There are also some older, rule-based approach papers which call this problem "word boundary detection" or more specifcally "agglutination", check out e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351975/ 相似,但通常您为该问题找到的 off-the-shelf 解决方案的数量非常少。
我没做过多少NLP,但是有需求。例如对于字符串 'australiafreedomrally',我需要自动提取有意义的单词,即 'australia'、'freedom' 和 'rally'.
有什么python包可以做到吗?谢谢
查看 this thread, where among other things a package is mentioned which does this. Generally an approach with a predefined list of common words can get you far. Your question has an overlap with the task of Optical Character Recognition (OCR) Post Correction,您可以找到一些预训练模型,尽管问题强烈转向一个问题(缺少空白字符)可能导致它的表现不太好。
如果你想真正进入这个主题,你可以尝试在这个任务上训练一个新模型,我可以想象最近流行的使用 subtoken-level 嵌入未知词的 transformer 模型可以被训练来带来在这个任务上表现不错,因为有些模型的方向与 grammar correction and sentence boundary correction. There are also some older, rule-based approach papers which call this problem "word boundary detection" or more specifcally "agglutination", check out e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351975/ 相似,但通常您为该问题找到的 off-the-shelf 解决方案的数量非常少。