查找人口最多的年份(最有效的解决方案)

Find the year with the highest population (most efficient solution)

给定两个数组; $births 包含表明某人出生时间的出生年份列表,$deaths 包含表明某人死亡时间的死亡年份列表,我们如何找到人口最多的年份?

例如给定以下数组:

$births = [1984, 1981, 1984, 1991, 1996];
$deaths = [1991, 1984];

人口最多的那一年应该是1996,因为那一年有3人在世,是所有那些年中人口最多的一年。

这是 运行 的数学公式:

| Birth | Death | Population |
|-------|-------|------------|
| 1981  |       | 1          |
| 1984  |       | 2          |
| 1984  | 1984  | 2          |
| 1991  | 1991  | 2          |
| 1996  |       | 3          |

假设

我们可以安全地假设某人出生的那一年人口会增加一,而某人去世的那一年人口会减少一。所以在这个例子中,1984年出生了2个人,1984年死了1个人,也就是说那一年人口增加了1。

我们还可以安全地假设死亡人数永远不会超过出生人数,并且当人口为 0 时不会发生死亡。

我们还可以安全地假设 $deaths$births 中的年份永远不会是负值或浮点值(它们总是大于 0).

We cannot assume that the arrays will be sorted or that there won't be duplicate values, however.

要求

我们必须编写一个函数来 return 人口最多的那一年,将这两个数组作为输入。该函数可能 return 0false""NULL任何虚假值都是可接受的)如果输入数组为空,或者总体始终为 0。如果最高人口出现在多个年份,函数可能 return 达到最高人口的第一年或任何后续年份。

例如:

$births = [1997, 1997, 1997, 1998, 1999];
$deaths = [1998, 1999];

/* The highest population was 3 on 1997, 1998 and 1999, either answer is correct */

此外,包含解决方案的 Big O 会有所帮助。


我对此的最佳尝试如下:

function highestPopulationYear(Array $births, Array $deaths): Int {

    sort($births);
    sort($deaths);

    $nextBirthYear = reset($births);
    $nextDeathYear = reset($deaths);

    $years = [];
    if ($nextBirthYear) {
        $years[] = $nextBirthYear;
    }
    if ($nextDeathYear) {
        $years[] = $nextDeathYear;
    }

    if ($years) {
        $currentYear = max(0, ...$years);
    } else {
        $currentYear = 0;
    }

    $maxYear = $maxPopulation = $currentPopulation = 0;

    while(current($births) !== false || current($deaths) !== false || $years) {

        while($currentYear === $nextBirthYear) {
            $currentPopulation++;
            $nextBirthYear = next($births);
        }

        while($currentYear === $nextDeathYear) {
            $currentPopulation--;
            $nextDeathYear = next($deaths);
        }

        if ($currentPopulation >= $maxPopulation) {
            $maxPopulation = $currentPopulation;
            $maxYear = $currentYear;
        }

        $years = [];

        if ($nextBirthYear) {
            $years[] = $nextBirthYear;
        }
        if ($nextDeathYear) {
            $years[] = $nextDeathYear;
        }
        if ($years) {
            $currentYear = min($years);
        } else {
            $currentYear = 0;
        }
    }

    return $maxYear;
}

上面的算法应该在多项式时间内工作,因为它是最坏的 O(((n log n) * 2) + k) 其中 n 是每个数组中要排序的元素数,k 是出生年份数(因为我们知道 k 总是 k >= y)其中 y 是死亡年数。但是,我不确定是否有更有效的解决方案。

My interests are purely in an improved Big O of computational complexity upon the existing algorithm. Memory complexity is of no concern. Nor is the runtime optimization. At least it's not a primary concern. Any minor/major runtime optimizations are welcome, but not the key factor here.

我认为我们可以有 O(n log n) 时间和 O(1) 额外的 space 通过首先排序,然后在我们迭代时保持当前人口和全局最大值。我尝试使用当年作为参考点,但逻辑似乎仍然有点棘手,所以我不确定它是否已完全解决。希望它可以提供有关该方法的想法。

JavaScript代码(counterexamples/bugs欢迎)

function f(births, deaths){
  births.sort((a, b) => a - b);
  deaths.sort((a, b) => a - b);

  console.log(JSON.stringify(births));
  console.log(JSON.stringify(deaths));
  
  let i = 0;
  let j = 0;
  let year = births[i];
  let curr = 0;
  let max = curr;

  while (deaths[j] < births[0])
    j++;

  while (i < births.length || j < deaths.length){
    while (year == births[i]){
      curr = curr + 1;
      i = i + 1;
    }
    
    if (j == deaths.length || year < deaths[j]){
      max = Math.max(max, curr);
      console.log(`year: ${ year }, max: ${ max }, curr: ${ curr }`);
    
    } else if (j < deaths.length && deaths[j] == year){
      while (deaths[j] == year){
        curr = curr - 1;
        j = j + 1;
      }
      max = Math.max(max, curr);
      console.log(`year: ${ year }, max: ${ max }, curr: ${ curr }`);
    }

    if (j < deaths.length && deaths[j] > year && (i == births.length || deaths[j] < births[i])){
      year = deaths[j];
      while (deaths[j] == year){
        curr = curr - 1;
        j = j + 1;
      }
      console.log(`year: ${ year }, max: ${ max }, curr: ${ curr }`);
    }

    year = births[i];
  }
  
  return max;
}

var input = [
  [[1997, 1997, 1997, 1998, 1999],
  [1998, 1999]],
  [[1, 2, 2, 3, 4],
  [1, 2, 2, 5]],
  [[1984, 1981, 1984, 1991, 1996],
  [1991, 1984, 1997]],
  [[1984, 1981, 1984, 1991, 1996],
  [1991, 1982, 1984, 1997]]
]

for (let [births, deaths] of input)
  console.log(f(births, deaths));

如果年份范围 m 的顺序为 n,我们可以将每年的计数存储在该范围内,时间复杂度为 O(n)。如果我们想要花哨的,我们也可以有 O(n * log log m) 时间复杂度,通过使用 Y-fast trie 允许在 O(log log m) 时间内进行后继查找。

我们可以用桶排序在线性时间内解决这个问题。假设输入的大小是n,年的范围是m。

O(n): Find the min and max year across births and deaths.
O(m): Create an array of size max_yr - min_yr + 1, ints initialized to zero. 
      Treat the first cell of the array as min_yr, the next as min_yr+1, etc...
O(n): Parse the births array, incrementing the appropriate index of the array. 
      arr[birth_yr - min_yr] += 1
O(n): Ditto for deaths, decrementing the appropriate index of the array.
      arr[death_yr - min_yr] -= 1
O(m): Parse your array, keeping track of the cumulative sum and its max value.

最大的累积最大值就是你的答案。

运行的时间是O(n+m),额外需要的space是O(m)。

如果 m 是 O(n),则这是 n 的线性解;即,如果年份范围的增长速度不超过出生和死亡人数的增长速度。对于现实世界的数据,这几乎肯定是正确的。

$births = [1984, 1981, 1984, 1991, 1996];
$deaths = [1991, 1984];
$years = array_unique(array_merge($births, $deaths));
sort($years);

$increaseByYear = array_count_values($births);
$decreaseByYear = array_count_values($deaths);
$populationByYear = array();

foreach ($years as $year) {
    $increase = $increaseByYear[$year] ?? 0;
    $decrease = $decreaseByYear[$year] ?? 0;
    $previousPopulationTally = end($populationByYear);
    $populationByYear[$year] = $previousPopulationTally + $increase - $decrease;
}

$maxPopulation = max($populationByYear);
$maxPopulationYears = array_keys($populationByYear, $maxPopulation);

$maxPopulationByYear = array_fill_keys($maxPopulationYears, $maxPopulation);
print_r($maxPopulationByYear);

这将解释平局年份的可能性,以及某人死亡的年份与某人的出生年份不对应的情况。

首先将出生和死亡汇总到地图 (year => population change) 中,按键排序,然后计算 运行 人口。

这应该大约是 O(2n + n log n),其中 n 是出生人数。

$births = [1984, 1981, 1984, 1991, 1996];
$deaths = [1991, 1984];

function highestPopulationYear(array $births, array $deaths): ?int
{
    $indexed = [];

    foreach ($births as $birth) {
        $indexed[$birth] = ($indexed[$birth] ?? 0) + 1;
    }

    foreach ($deaths as $death) {
        $indexed[$death] = ($indexed[$death] ?? 0) - 1;
    }

    ksort($indexed);

    $maxYear = null;
    $max = $current = 0;

    foreach ($indexed as $year => $change) {
        $current += $change;
        if ($current >= $max) {
            $max = $current;
            $maxYear = $year;
        }
    }

    return $maxYear;
}

var_dump(highestPopulationYear($births, $deaths));

我用 O(n+m) 的内存需求解决了这个问题 [在最坏的情况下,最好的情况下 O(n)]

并且,时间复杂度为 O(n logn)

这里,n & mbirthsdeaths数组的长度。

我不知道 PHP 或 java 脚本。我用 Java 实现了它,逻辑非常简单。但我相信我的想法也可以用这些语言实现。

技术细节:

我使用javaTreeMap结构来存储出生和死亡记录。

TreeMap 插入数据 sorted (key based) 作为 (key, value) 对,这里 key 是年份,value 是出生和死亡的累计总和(死亡为负数)。

我们不需要插入最高出生年份之后发生的死亡值。

一旦 TreeMap 填充了出生和死亡记录,所有累积总和都会更新,并随着时间的推移存储最大人口和年份。

示例输入和输出:1

Births: [1909, 1919, 1904, 1911, 1908, 1908, 1903, 1901, 1914, 1911, 1900, 1919, 1900, 1908, 1906]

Deaths: [1910, 1911, 1912, 1911, 1914, 1914, 1913, 1915, 1914, 1915]

Year counts Births: {1900=2, 1901=1, 1903=1, 1904=1, 1906=1, 1908=3, 1909=1, 1911=2, 1914=1, 1919=2}

Year counts Birth-Deaths combined: {1900=2, 1901=1, 1903=1, 1904=1, 1906=1, 1908=3, 1909=1, 1910=-1, 1911=0, 1912=-1, 1913=-1, 1914=-2, 1915=-2, 1919=2}

Yearwise population: {1900=2, 1901=3, 1903=4, 1904=5, 1906=6, 1908=9, 1909=10, 1910=9, 1911=9, 1912=8, 1913=7, 1914=5, 1915=3, 1919=5}

maxPopulation: 10
yearOfMaxPopulation: 1909

示例输入和输出:2

Births: [1906, 1901, 1911, 1902, 1905, 1911, 1902, 1905, 1910, 1912, 1900, 1900, 1904, 1913, 1904]

Deaths: [1917, 1908, 1918, 1915, 1907, 1907, 1917, 1917, 1912, 1913, 1905, 1914]

Year counts Births: {1900=2, 1901=1, 1902=2, 1904=2, 1905=2, 1906=1, 1910=1, 1911=2, 1912=1, 1913=1}

Year counts Birth-Deaths combined: {1900=2, 1901=1, 1902=2, 1904=2, 1905=1, 1906=1, 1907=-2, 1908=-1, 1910=1, 1911=2, 1912=0, 1913=0}

Yearwise population: {1900=2, 1901=3, 1902=5, 1904=7, 1905=8, 1906=9, 1907=7, 1908=6, 1910=7, 1911=9, 1912=9, 1913=9}

maxPopulation: 9
yearOfMaxPopulation: 1906

此处,在最后一个出生年份 1913 之后发生的死亡 (1914 & later) 根本不计算在内,这样可以避免不必要的计算。

对于总共 10 million 数据(出生和死亡合计)和超过 1000 years range,程序大约需要 3 sec. 完成。

如果与 100 years range 相同大小的数据,则需要 1.3 sec

所有输入都是随机取的。

内存明智的做法是保持 currentPopulationcurrentYear 计算。从对 $births$deaths 数组进行排序开始是一个很好的点,因为冒泡排序不是那么繁重的任务,但允许偷工减料:

<?php

$births = [1997, 1999, 2000];
$deaths = [2000, 2001, 2001];

function highestPopulationYear(array $births, array $deaths): Int {

    // sort takes time, but is neccesary for futher optimizations
    sort($births);
    sort($deaths);

    // first death year is a first year where population might decrase 
    // sorfar max population
    $currentYearComputing = $deaths[0];

    // year before first death has potential of having the biggest population
    $maxY = $currentYearComputing-1;

    // calculating population at the begining of the year of first death, start maxPopulation
    $population = $maxPop = count(array_splice($births, 0, array_search($deaths[0], $births)));

    // instead of every time empty checks: `while(!empty($deaths) || !empty($births))`
    // we can control a target time. It reserves a memory, but this slot is decreased
    // every iteration.
    $iterations = count($deaths) + count($births);

    while($iterations > 0) {
        while(current($births) === $currentYearComputing) {
            $population++;
            $iterations--;
            array_shift($births); // decreasing memory usage
        }

        while(current($deaths) === $currentYearComputing) {
            $population--;
            $iterations--;
            array_shift($deaths); // decreasing memory usage
        }

        if ($population > $maxPop) {
            $maxPop = $population;
            $maxY = $currentYearComputing;
        }

        // In $iterations we have a sum of birth/death events left. Assuming all 
        // are births, if this number added to currentPopulation will never exceed
        // current maxPoint, we can break the loop and save some time at cost of
        // some memory.
        if ($maxPop >= ($population+$iterations)) {
            break;
        }

        $currentYearComputing++;
    }

    return $maxY;
}

echo highestPopulationYear($births, $deaths);

不太热衷于深入 Big O 事情,留给你了。

此外,如果您重新发现 currentYearComputing 每个循环,您可以将循环更改为 if 语句并只留下一个循环。

    while($iterations > 0) {

        $changed = false;

        if(current($births) === $currentYearComputing) {
            // ...
            $changed = array_shift($births); // decreasing memory usage
        }

        if(current($deaths) === $currentYearComputing) {
            // ...
            $changed = array_shift($deaths); // decreasing memory usage
        }

        if ($changed === false) {
            $currentYearComputing++;
            continue;
        }

我对这个解决方案很满意,大O的复杂度是n + m

<?php
function getHighestPopulation($births, $deaths){
    $max = [];
    $currentMax = 0;
    $tmpArray = [];

    foreach($deaths as $key => $death){
        if(!isset($tmpArray[$death])){
            $tmpArray[$death] = 0;    
        }
        $tmpArray[$death]--;
    }
    foreach($births as $k => $birth){
        if(!isset($tmpArray[$birth])){
            $tmpArray[$birth] = 0;
        }
        $tmpArray[$birth]++;
        if($tmpArray[$birth] > $currentMax){
            $max = [$birth];
            $currentMax = $tmpArray[$birth];
        } else if ($tmpArray[$birth] == $currentMax) {
            $max[] = $birth;
        }
    }

    return [$currentMax, $max];
}

$births = [1997, 1997, 1997, 1998, 1999];
$deaths = [1998, 1999];

print_r (getHighestPopulation($births, $deaths));
?>

解决您的问题的最简单明了的方法之一。

$births = [1909, 1919, 1904, 1911, 1908, 1908, 1903, 1901, 1914, 1911, 1900, 1919, 1900, 1908, 1906];
$deaths = [1910, 1911, 1912, 1911, 1914, 1914, 1913, 1915, 1914, 1915];

/* for generating 1 million records

for($i=1;$i<=1000000;$i++) {
    $births[] = rand(1900, 2020);
    $deaths[] = rand(1900, 2020);
}
*/

function highestPopulationYear(Array $births, Array $deaths): Int {
    $start_time = microtime(true); 
    $population = array_count_values($births);
    $deaths = array_count_values($deaths);

    foreach ($deaths as $year => $death) {
        $population[$year] = ($population[$year] ?? 0) - $death;
    }
    ksort($population, SORT_NUMERIC);
    $cumulativeSum = $maxPopulation = $maxYear = 0;
    foreach ($population as $year => &$number) {
        $cumulativeSum += $number;
        if($maxPopulation < $cumulativeSum) {
            $maxPopulation = $cumulativeSum;
            $maxYear = $year;
        }
    }
    print " Execution time of function = ".((microtime(true) - $start_time)*1000)." milliseconds"; 
    return $maxYear;
}

print highestPopulationYear($births, $deaths);

输出

1909

复杂性:

O(m + log(n))