如何在不使用 PHP 打破句子的情况下按 500 个字符对句子进行分组?

How to group sentences by groups of 500 characters without breaking a sentence with PHP?

我一直在摸不着头脑,但一直想不出解决办法。

假设您有一段 5000 个字符的文本,我想将其拆分为少于 500 个字符的块,但不要打断一个句子。例如:如果一个段落是 550 个单词,最后一个句子以 550 个字符结束,但以 450 个字符开始,我想将这个特定的块保存到最多 450 个字符(这样就没有句子被破坏)。

知道如何实现吗?

我的目标是将每个块保存到一个数组中,这样我就可以分别处理它们。

我正在考虑使用 preg_split,对输出求和,如果总和超过 500 个字符,则删除最后一个总和。但是.....我发现很难把句子分开而不出错。

知道我应该使用什么 preg_split 规则来确保每个句子都分开吗?

我尝试使用此工具,但无法获得正确的输出: https://www.phpliveregex.com/#tab-preg-split

谢谢

我认为你需要这个

$string = "Hello world php is fun";
$array = explode(" ", $string);

输出是

Array ( [0] => Hello [1] => world [2] => php [3] => is [4] => fun )

首先:谢谢你的好问题!

解决方案不是很稳定,以后需要调整。但它会告诉你存档的可能方法。

将您的文本拆分为单独的句子,并将每个句子保存为数组中的一个元素。这样就可以在迭代数组时确定句子的长度。只要句子和上一句都小于最大块长度,就把字符串放到一个临时变量中。一旦临时变量的文本长度 + 当前记录的长度大于最大块长度,记录将作为块存储在新数组中。

<?php
$txt = "111. 222 222. 333 333 333. 444 444 444 444. 555 555 555 555 555. 333 333 333. 222 222. 111.";

$length = 30;
$arr = explode(". ", $txt);
$b = [];
$tmp = '';

foreach($arr as $k => $s) {
    if (strlen($s) + strlen($tmp) <= ($length) ) {
        $tmp = $tmp . $s .'. ';
    } else {
        $b[] = $tmp;
        $tmp = '';
        $tmp = $s . '. ';
    }
    
    if((count($arr)-1) === $k) {
        $b[] = substr($tmp, 0, -2);   
    }    
}

print_r($arr);
print_r($b);

输出

// Sentence Array
Array
(
    [0] => 111
    [1] => 222 222
    [2] => 333 333 333
    [3] => 444 444 444 444
    [4] => 555 555 555 555 555
    [5] => 333 333 333
    [6] => 222 222
    [7] => 111.
)

// Your new Block Array
Array
(
    [0] => 111. 222 222. 333 333 333. 
    [1] => 444 444 444 444.
    [2] => 555 555 555 555 555.
    [3] => 333 333 333. 222 222. 111.
)

似乎更容易按句子拆分,然后你应该能够在它上面循环并在你超出边界时连接起来

$data = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Id cursus metus aliquam eleifend mi in nulla posuere. Hac habitasse platea dictumst vestibulum rhoncus. Elementum facilisis leo vel fringilla est. Sem et tortor consequat id. Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique. Dictumst vestibulum rhoncus est pellentesque elit. Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu. Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus.

Tortor consequat id porta nibh venenatis cras sed. Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam. Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec. Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue. In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed. Tristique senectus et netus et malesuada fames ac turpis.';

$splited = preg_split('/([^.]+\.)/mU', $data, -1, PREG_SPLIT_DELIM_CAPTURE);
// Basically here, I try to find everything before a `.`

$cleaned = array_filter(array_map('trim', $splited));

var_dump($cleaned);

我有那个

array(22) {
  [1]=>
  string(123) "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
  [3]=>
  string(53) "Id cursus metus aliquam eleifend mi in nulla posuere."
  [5]=>
  string(49) "Hac habitasse platea dictumst vestibulum rhoncus."
  [7]=>
  string(42) "Elementum facilisis leo vel fringilla est."
  [9]=>
  string(27) "Sem et tortor consequat id."
  [11]=>
  string(44) "Eleifend donec pretium vulputate sapien nec."
  [13]=>
  string(43) "Elit pellentesque habitant morbi tristique."
  [15]=>
  string(50) "Dictumst vestibulum rhoncus est pellentesque elit."
  [17]=>
  string(40) "Quis commodo odio aenean sed adipiscing."
  [19]=>
  string(53) "Id volutpat lacus laoreet non curabitur gravida arcu."
  [21]=>
  string(40) "Sit amet massa vitae tortor condimentum."
  [23]=>
  string(49) "Morbi blandit cursus risus at ultrices mi tempus."
  [25]=>
  string(50) "Tortor consequat id porta nibh venenatis cras sed."
  [27]=>
  string(38) "Urna et pharetra pharetra massa massa."
  [29]=>
  string(32) "Ut consequat semper viverra nam."
  [31]=>
  string(47) "Hac habitasse platea dictumst quisque sagittis."
  [33]=>
  string(46) "Commodo odio aenean sed adipiscing diam donec."
  [35]=>
  string(45) "Imperdiet proin fermentum leo vel orci porta."
  [37]=>
  string(40) "Quisque non tellus orci ac auctor augue."
  [39]=>
  string(37) "In cursus turpis massa tincidunt dui."
  [41]=>
  string(38) "Purus faucibus ornare suspendisse sed."
  [43]=>
  string(57) "Tristique senectus et netus et malesuada fames ac turpis."
}

Maik 的快速更新 ;)

$data = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Id cursus metus aliquam eleifend mi in nulla posuere. Hac habitasse platea dictumst vestibulum rhoncus. Elementum facilisis leo vel fringilla est. Sem et tortor consequat id. Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique. Dictumst vestibulum rhoncus est pellentesque elit. Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu. Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus.

Tortor consequat id porta nibh venenatis cras sed. Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam. Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec. Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue. In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed. Tristique senectus et netus et malesuada fames ac turpis.';

$splited = preg_split('/([^.]+\.)/mU', $data, -1, PREG_SPLIT_DELIM_CAPTURE);
// Basically here, I try to find everything before a `.`

$cleaned = array_filter(array_map('trim', $splited));

$lines = [];
$current = '';
$min = 50;

foreach ($cleaned as $sentence) {
  $current .= $sentence . ' '; // Mandatory to allow to add an other sentence
  $len_current = strlen($current);

  if ($len_current >= $min) {
    array_push($lines, trim($current)); // As we add an extra space, we remove it when adding to the lines

    $current = '';
  }
}

看起来像这样

array(14) {
  [0]=>
  string(123) "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
  [1]=>
  string(53) "Id cursus metus aliquam eleifend mi in nulla posuere."
  [2]=>
  string(49) "Hac habitasse platea dictumst vestibulum rhoncus."
  [3]=>
  string(70) "Elementum facilisis leo vel fringilla est. Sem et tortor consequat id."
  [4]=>
  string(88) "Eleifend donec pretium vulputate sapien nec. Elit pellentesque habitant morbi tristique."
  [5]=>
  string(50) "Dictumst vestibulum rhoncus est pellentesque elit."
  [6]=>
  string(94) "Quis commodo odio aenean sed adipiscing. Id volutpat lacus laoreet non curabitur gravida arcu."
  [7]=>
  string(90) "Sit amet massa vitae tortor condimentum. Morbi blandit cursus risus at ultrices mi tempus."
  [8]=>
  string(50) "Tortor consequat id porta nibh venenatis cras sed."
  [9]=>
  string(71) "Urna et pharetra pharetra massa massa. Ut consequat semper viverra nam."
  [10]=>
  string(94) "Hac habitasse platea dictumst quisque sagittis. Commodo odio aenean sed adipiscing diam donec."
  [11]=>
  string(86) "Imperdiet proin fermentum leo vel orci porta. Quisque non tellus orci ac auctor augue."
  [12]=>
  string(76) "In cursus turpis massa tincidunt dui. Purus faucibus ornare suspendisse sed."
  [13]=>
  string(57) "Tristique senectus et netus et malesuada fames ac turpis."
}