将长字符串剪切成包含完整句子的段落

Question

我的任务是使用在线翻译 api（google、yandex 等）翻译非常长的文本（超过 50k 符号）。它们都有请求长度的限制。所以，我想将我的文本剪切成长度小于这些限制的字符串列表，但也要保留未剪切的句子。

例如，如果我要处理的文本限制为 300 个符号：

The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government. This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java. Current versions of our software from October 2014 forward require Java 8+. (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.) Distribution packages include components for command-line invocation, jar files, a Java API, and source code. You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.

我应该得到那个输出：

['The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.', 
'These packages are widely used in industry, academia, and government. This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java.', 
'Current versions of our software from October 2014 forward require Java 8+. (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.)', 
'Distribution packages include components for command-line invocation, jar files, a Java API, and source code. You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages.', 
'As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.']

最符合 pythonic 的方法是什么？是否有任何正则表达式可以实现这一目标？

Answer 1

正则表达式不是从段落中解析句子的正确工具。你应该看看 nltk

import nltk

# this line only needs to be run once per environment:
nltk.download('punkt') 

text = """The Stanford NLP Group makes some of our Natural Language Processing software available to everyone! We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government. This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java. Current versions of our software from October 2014 forward require Java 8+. (Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.) Distribution packages include components for command-line invocation, jar files, a Java API, and source code. You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages."""

sents = nltk.sent_tokenize(text)

sents
# outputs:
['The Stanford NLP Group makes some of our Natural Language Processing software available to everyone!',
 'We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government.',
 'This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis.',
 'All our supported software distributions are written in Java.',
 'Current versions of our software from October 2014 forward require Java 8+.',
 '(Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+.',
 'The Stanford Parser was first written in Java 1.1.)',
 'Distribution packages include components for command-line invocation, jar files, a Java API, and source code.',
 'You can also find us on GitHub and Maven.',
 'A number of helpful people have extended our work, with bindings or translations for other languages.',
 'As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.']

根据累积长度聚合句子的一种方法是使用生成器函数：

此处，如果字符串的长度超过 300 个字符或已到达可迭代对象的末尾，函数 g 将生成一个连接的字符串。此函数假定没有单个句子超过 300 个字符的限制。

def g(sents):
    idx = 0
    text_length = 0
    for i, s in enumerate(sents):
        if text_length + len(s) > 300:
            yield ' '.join(sents[idx:i])
            text_length = len(s)
            idx = i
        else:
            text_length += len(s)
    yield ' '.join(sents[idx:])

句子聚合器可以这样调用：

for s in g(sents):
    print(s)
outputs:
The Stanford NLP Group makes some of our Natural Language Processing software available to everyone!
We provide statistical NLP, deep learning NLP, and rule-based NLP tools for major computational linguistics problems, which can be incorporated into applications with human language technology needs.These packages are widely used in industry, academia, and government.
This code is actively being developed, and we try to answer questions and fix bugs on a best-effort basis. All our supported software distributions are written in Java. Current versions of our software from October 2014 forward require Java 8+.
(Versions from March 2013 to September 2014 required Java 1.6+; versions from 2005 to Feb 2013 required Java 1.5+. The Stanford Parser was first written in Java 1.1.) Distribution packages include components for command-line invocation, jar files, a Java API, and source code.
You can also find us on GitHub and Maven. A number of helpful people have extended our work, with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, F#, and other .NET and JVM languages.

检查每个文本段的长度表明所有段都少于 300 个字符：

[len(s) for s in g(sents)]
#outputs:
[100, 268, 244, 276, 289]

将长字符串剪切成包含完整句子的段落

Cut long string to paragraphs containing full sentences

google-translate

python-3.x