规范化状态数据

Normalize State Data

我有一个从用户那里收集状态数据的遗留表格。它在文本字段中收集数据,因此用户基本上可以输入他们想要的任何内容。一些示例输入包括:

'plattsburgh, new york'
'California'
'Central Valley,Ca/ Ptld, Oregon'
'Bay area,CA'
'new port richey florida'
'HAMPTON ROADS AREA'
'DC Metro area'
'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,'

我想要做的是使用 PHP 以编程方式规范化此数据,以仅提取状态并删除其他所有内容。所以鉴于上面的例子,我想这样转换它们:

'plattsburgh, new york' => 'NY'
'California' => 'CA'
'Central Valley,Ca/ Ptld, Oregon' => 'OR'
'Bay area,CA' => 'CA'
'new port richey florida' => 'CA'
'HAMPTON ROADS AREA' => ''
'DC Metro area' => 'DC'
'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,' => 'PA,CO,NY,MD'

是否有一种简洁明了的方法来完成此任务?

您正在做的是获取用户输入的数据,并尝试以可靠的方式对其进行格式化。正如我将在我的示例中展示的那样,这通常是一件坏事,因为用户输入的数据可能充满不一致。我建议改进您的数据收集(让用户填写布局合理的表格,地址的每个部分都有单独的字段,或者让他们 select 从州列表中选择一个州)。但现在,我们将研究如何尽可能准确地格式化您的数据。

实现您想要做的事情的最简单方法是生成 "map" 数据,并使用地图格式化您的数据。

例如,我创建了两张地图,如下所示。第一个是 "name" 到 "a2" 的映射。该数组包含状态名称(大写)作为索引,状态 a2 作为值。

$nameToA2Map = array (
    'ALABAMA' => 'AL',
    'ALASKA' => 'AK',
    'AMERICAN SAMOA' => 'AS',
    'ARIZONA' => 'AZ',
    'ARKANSAS' => 'AR',
    'CALIFORNIA' => 'CA',
    'COLORADO' => 'CO',
    'CONNECTICUT' => 'CT',
    'DELAWARE' => 'DE',
    'DISTRICT OF COLUMBIA' => 'DC',
    'FEDERATED STATES OF MICRONESIA' => 'FM',
    'FLORIDA' => 'FL',
    'GEORGIA' => 'GA',
    'GUAM' => 'GU',
    'HAWAII' => 'HI',
    'IDAHO' => 'ID',
    'ILLINOIS' => 'IL',
    'INDIANA' => 'IN',
    'IOWA' => 'IA',
    'KANSAS' => 'KS',
    'KENTUCKY' => 'KY',
    'LOUISIANA' => 'LA',
    'MAINE' => 'ME',
    'MARSHALL ISLANDS' => 'MH',
    'MARYLAND' => 'MD',
    'MASSACHUSETTS' => 'MA',
    'MICHIGAN' => 'MI',
    'MINNESOTA' => 'MN',
    'MISSISSIPPI' => 'MS',
    'MISSOURI' => 'MO',
    'MONTANA' => 'MT',
    'NEBRASKA' => 'NE',
    'NEVADA' => 'NV',
    'NEW HAMPSHIRE' => 'NH',
    'NEW JERSEY' => 'NJ',
    'NEW MEXICO' => 'NM',
    'NEW YORK' => 'NY',
    'NORTH CAROLINA' => 'NC',
    'NORTH DAKOTA' => 'ND',
    'NORTHERN MARIANA ISLANDS' => 'MP',
    'OHIO' => 'OH',
    'OKLAHOMA' => 'OK',
    'OREGON' => 'OR',
    'PALAU' => 'PW',
    'PENNSYLVANIA' => 'PA',
    'PUERTO RICO' => 'PR',
    'RHODE ISLAND' => 'RI',
    'SOUTH CAROLINA' => 'SC',
    'SOUTH DAKOTA' => 'SD',
    'TENNESSEE' => 'TN',
    'TEXAS' => 'TX',
    'UTAH' => 'UT',
    'VERMONT' => 'VT',
    'VIRGIN ISLANDS' => 'VI',
    'VIRGINIA' => 'VA',
    'WASHINGTON' => 'WA',
    'WEST VIRGINIA' => 'WV',
    'WISCONSIN' => 'WI',
    'WYOMING' => 'WY',
);

作为故障保险,我们也将构建状态 a2 值的数组,因为您数据的某些字段不包含状态名称:

$a2Map = array (
    0 => 'AL',
    1 => 'AK',
    2 => 'AS',
    3 => 'AZ',
    4 => 'AR',
    5 => 'CA',
    6 => 'CO',
    7 => 'CT',
    8 => 'DE',
    9 => 'DC',
    10 => 'FM',
    11 => 'FL',
    12 => 'GA',
    13 => 'GU',
    14 => 'HI',
    15 => 'ID',
    16 => 'IL',
    17 => 'IN',
    18 => 'IA',
    19 => 'KS',
    20 => 'KY',
    21 => 'LA',
    22 => 'ME',
    23 => 'MH',
    24 => 'MD',
    25 => 'MA',
    26 => 'MI',
    27 => 'MN',
    28 => 'MS',
    29 => 'MO',
    30 => 'MT',
    31 => 'NE',
    32 => 'NV',
    33 => 'NH',
    34 => 'NJ',
    35 => 'NM',
    36 => 'NY',
    37 => 'NC',
    38 => 'ND',
    39 => 'MP',
    40 => 'OH',
    41 => 'OK',
    42 => 'OR',
    43 => 'PW',
    44 => 'PA',
    45 => 'PR',
    46 => 'RI',
    47 => 'SC',
    48 => 'SD',
    49 => 'TN',
    50 => 'TX',
    51 => 'UT',
    52 => 'VT',
    53 => 'VI',
    54 => 'VA',
    55 => 'WA',
    56 => 'WV',
    57 => 'WI',
    58 => 'WY',
);

然后我们可以使用这些数组来格式化您的数据:

//we'll format the states into this array
$statesFormatted = array();

//if your states data is a string like displayed in your post:
$states = "
'plattsburgh, new york'
'California'
'Central Valley,Ca/ Ptld, Oregon'
'Bay area,CA'
'new port richey florida'
'HAMPTON ROADS AREA'
'DC Metro area'
'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,'
";
//read the string into an array, by splitting up the new lines:
$stateFields = explode("\n",$states);

//if your states data is an array:
$stateFields = array(
    'plattsburgh, new york',
    'California',
    'Central Valley,Ca/ Ptld, Oregon',
    'Bay area,CA',
    'new port richey florida',
    'HAMPTON ROADS AREA',
    'DC Metro area',
    'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,',
);


//go through each state field
foreach($stateFields as $field) {
    //remove the single quotes?
    //$field = str_replace("'","",$field);

    //first, we look to try and match a name to a2
    foreach($nameToA2Map as $name => $a2) {
        //if the name can be found in the field
        if(strpos(strtoupper($field),$name) !== false) {
            //match!!
            $statesFormatted[$field] = $a2;
            //we break here, as we don't need to search anymore
            break;
        }
    }

    //lets check that we found a match
    if(!isset($statesFormatted[$field])) {
        //we didn't find a match, lets smartly try and find an a2 value in the field
        foreach($a2Map as $a2) {
            if(preg_match("/[\W]".$a2."[\W]/", $field) >= 1) {
                //match!!
                $statesFormatted[$field] = $a2;
                //we break here, as we don't need to search anymore
                break;
            }
        }
    }

    //if we still can't find a match, then we we do some sort of fail-safe here...
    if(!isset($statesFormatted[$field])) {
        $statesFormatted[$field] = "COULD NOT MATCH!";
    }
}

echo "<pre>";
print_r($statesFormatted);
echo "</pre>";

以上将输出:

Array ( [plattsburgh, new york] => NY [California] => CA [Central Valley,Ca/ Ptld, Oregon] => OR [Bay area,CA] => COULD NOT MATCH! [new port richey florida] => FL [HAMPTON ROADS AREA] => COULD NOT MATCH! [DC Metro area] => COULD NOT MATCH! [Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,] => CO )

如果您注意到,在代码的最后几行中,我会进行最后检查以查看是否无法匹配某个字段,并为其赋值 "COULD NOT MATCH!"。在这些字段中,用户输入的数据太不一致而无法轻松匹配。您可以将这些字段视为不一致的数据,或者向 $nameToA2Map 数组添加其他条件。不过我不建议这样做。

希望对您有所帮助。