如何检查一个句子是否具有可读性？

How to go about checking if a sentence makes readable sense?

我的目标是能够检测计算机生成的旋转内容。以下是旋转文本的一些示例：

"As a explicit art fashionable for an advertising organization, you will job to assist put up for auction customers' crop and/or armed forces to their aim marketplace by your original skill and technological ability."

"The actual apple iphone application shop is definitely an abundant cherish residence of useful apps."

基本上，计算机已经用各种同义词替换了单词，试图使内容独一无二以绕过抄袭检测。我的目标是制作一个可以检测此乱码文本的系统。这可以通过哪些方式实现？

您想做的是制作一个 ngram language model。 ngram 语言模型是一种语言中单词对出现的统计表示，用于机器翻译、情感分析和分类任务，例如预测电影评论是正面还是负面。您的分类任务是每个句子是否是旋转内容。

像朴素贝叶斯 (implemented in NLTK) 这样的分类模型可以帮助您解决问题。在训练中，它制作了一个语言模型，然后使用该模型进行预测。要训练模型，您需要编写内容示例和一堆常规英文文本。两者越多越好！所有文档（您可以将每个句子视为一个文档）都应该被标记以表明它们是否是旋转内容。

这是一份英文列表corpora，供您使用非旋转文本。

更复杂的模型可能效果更好，您可以很容易地并排比较它们。我喜欢用 scikit-learn 做这类事情。

如何检查一个句子是否具有可读性？

How to go about checking if a sentence makes readable sense?

readability

machine-learning

nltk