CamelCase 复杂输入字符串的最佳（最便宜）方法是什么？

Question

我有大量实时传入的词组需要按词和分割点转换成alpha only - CamelCase

到目前为止我就是这么想的，但是有没有更便宜、更快捷的方法来执行该任务？

function FoxJourneyLikeACamelsHump(string $string): string {
  $string = preg_replace("/[^[:alpha:][:space:]]/u", ' ', $string);
  $string = ucwords($string);
  $camelCase = preg_replace('/\s+/', '', $string);
  return $camelCase;
}

// $expected = "ThQuCkBrWnFXJumpsVRThLZyDG";
$string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
$is = FoxJourneyLikeACamelsHump($string);

结果：

句数： 100000000
总时间： 40.844197034836 seconds
平均： 0.000000408

Answer 1

您可以试试这个正则表达式：

(?:\b|\d+)([a-z])|[\d+ +!.@]

UPDTAE ( Run it here )

好吧，上面的想法是向您展示它应该如何在正则表达式中工作：

以下是上述正则表达式的 php 实现。您可以将它与您的进行比较，因为这样可以通过单个替换操作完成操作：

<?php

$re = '/(?:\b|\d+)([a-z])|[\d+ +!.@]/';
$str = 'Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ';
$subst=strtoupper('\1');

$result = preg_replace_callback($re,function ($matches) {
return (isset($matches[1]) ? strtoupper($matches[1]) : '');
    },$str);

echo $result;

?>

Regex Demo

Answer 2

针对3个备选方案进行基准测试，我相信你的方法是最快的。这是 100,000 次迭代的结果：

array(4) {
  ["Test1"]=>
  float(0.23144102096558)
  ["Test2"]=>
  float(0.41140103340149)
  ["Test3"]=>
  float(0.31215810775757)
  ["Test4"]=>
  float(0.98423790931702)
}

其中 Test1 是你的，Test2 和 Test3 是我的，Test4 来自@RizwanMTuman 的回答（已修复）。

我认为使用 preg_split 可能会给您一个优化的机会。在此函数中，仅使用了 1 个正则表达式和 returns 仅包含 alpha 项的数组，然后您将 ucfirst 应用于：

function FoxJourneyLikeACamelsHump_2(string $string): string {
    return implode('', array_map(function($word) {
        return ucfirst($word);
    }, preg_split("/[^[:alpha:]]/", $string, null, PREG_SPLIT_NO_EMPTY)));
}

这可以通过使用 foreach 而不是 array_map 来进一步优化（参见 here）：

function FoxJourneyLikeACamelsHump_3(string $string): string {
    $validItems = preg_split("/[^[:alpha:]]/u", $string, null, PREG_SPLIT_NO_EMPTY);
    $result = '';
    foreach($validItems as $item) {
        $result .= ucfirst($item);
    }
    return $result;
}

这让我推测 2 个正则表达式和 1 个 ucwords 比 1 个正则表达式和多个 ucfirst 快。

完整测试脚本：

<?php

// yours
function FoxJourneyLikeACamelsHump_1(string $string): string {
  $string = preg_replace("/[^[:alpha:][:space:]]/u", ' ', $string);
  $string = ucwords($string);
  $camelCase = preg_replace('/\s+/', '', $string);
  return $camelCase;
}

// mine v1
function FoxJourneyLikeACamelsHump_2(string $string): string {
    return implode('', array_map(function($word) {
        return ucfirst($word);
    }, preg_split("/[^[:alpha:]]/", $string, null, PREG_SPLIT_NO_EMPTY)));
}

// mine v2
function FoxJourneyLikeACamelsHump_3(string $string): string {
    $validItems = preg_split("/[^[:alpha:]]/u", $string, null, PREG_SPLIT_NO_EMPTY);
    $result = '';
    foreach($validItems as $item) {
        $result .= ucfirst($item);
    }
    return $result;
}

// Rizwan with a fix
function FoxJourneyLikeACamelsHump_4(string $string): string {
    $re = '/(?:\b|\d+)([a-z])|[\d+ +!.@]/';
    $result = preg_replace_callback($re,function ($matches) {
        return (isset($matches[1]) ? strtoupper($matches[1]) : '');
    },$string);
    return $result;
}


// $expected = "ThQuCkBrWnFXJumpsVRThLZyDG";
$test1 = 0;
$test2 = 0;
$test3 = 0;
$test4 = 0;

$loops = 100000;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_1($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test1 = $time_end - $time_start;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_2($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test2 = $time_end - $time_start;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_3($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test3 = $time_end - $time_start;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_4($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test4 = $time_end - $time_start;

var_dump(array('Test1'=>$test1, 'Test2'=>$test2, 'Test3'=>$test3, 'Test4'=>$test4));

Answer 3

您的代码非常高效。您仍然可以通过一些调整来改进：

为 ucwords 提供分隔符，这样它就不必查找 \t、\n 等，它们在第一步之后不会以任何方式出现在您的字符串中。这平均提高了 1%；
您可以在 space 上使用非正则表达式替换执行最后一步。这提供了高达 20% 的改进。

代码：

function FoxJourneyLikeACamelsHump(string $string): string {
    $string = preg_replace("/[^[:alpha:][:space:]]/u", ' ', $string);
    $string = ucwords($string, ' ');
    $camelCase = str_replace(' ', '', $string);
    return $camelCase;
}

在 rextester.com 上查看原始版本和改进版本的时间安排。

注意：当您使用 ucwords 时，您的代码通常不能可靠地用于 unicode 字符串。为此，您需要使用 mb_convert_case 之类的函数：

$string = mb_convert_case($string,  MB_CASE_TITLE);

...但这会影响性能。

Answer 4

在考虑提高代码性能之前，您首先需要构建有效的代码。实际上，您正在尝试构建一个处理 utf8 编码字符串的代码（因为您将 u 修饰符添加到您的模式中）；但是使用字符串： liberté égalité fraternité 你的代码 returns Liberté égalité Fraternité 而不是 Liberté Égalité Fraternité 因为 ucwords （或 ucfirst）无法处理 multibyte characters.

在尝试了不同的方法后（使用 preg_split 和 preg_replace_callback），似乎这个 preg_match_all 版本是最快的：

function FoxJourneyLikeACamelsHumpUPMA(string $string): string {
    preg_match_all('~\pL+~u', $string, $m);
    foreach ($m[0] as &$v) {
        $v = mb_strtoupper(mb_substr($v, 0, 1)) . mb_strtolower(mb_substr($v, 1));
    }
    return implode('', $m[0]);
}

显然，它比您的初始代码慢，但我们无法真正比较这些不同的代码，因为您的代码不起作用。

CamelCase 复杂输入字符串的最佳（最便宜）方法是什么？

What is the best (cheapest) way to CamelCase complex input strings?

php

regex

camelcasing

php-7

php-7.1

结果：