从没有空格的文本中提取有意义的词

Extract meaningful words from spaceless texts

我没做过多少NLP，但是有需求。例如对于字符串 'australiafreedomrally'，我需要自动提取有意义的单词，即 'australia'、'freedom' 和 'rally'.

有什么python包可以做到吗？谢谢

查看 this thread, where among other things a package is mentioned which does this. Generally an approach with a predefined list of common words can get you far. Your question has an overlap with the task of Optical Character Recognition (OCR) Post Correction，您可以找到一些预训练模型，尽管问题强烈转向一个问题（缺少空白字符）可能导致它的表现不太好。

如果你想真正进入这个主题，你可以尝试在这个任务上训练一个新模型，我可以想象最近流行的使用 subtoken-level 嵌入未知词的 transformer 模型可以被训练来带来在这个任务上表现不错，因为有些模型的方向与 grammar correction and sentence boundary correction. There are also some older, rule-based approach papers which call this problem "word boundary detection" or more specifcally "agglutination", check out e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351975/ 相似，但通常您为该问题找到的 off-the-shelf 解决方案的数量非常少。

从没有空格的文本中提取有意义的词

Extract meaningful words from spaceless texts

nlp

nltk

stanford-nlp

gensim

spacy