规范化状态数据
Normalize State Data
我有一个从用户那里收集状态数据的遗留表格。它在文本字段中收集数据,因此用户基本上可以输入他们想要的任何内容。一些示例输入包括:
'plattsburgh, new york'
'California'
'Central Valley,Ca/ Ptld, Oregon'
'Bay area,CA'
'new port richey florida'
'HAMPTON ROADS AREA'
'DC Metro area'
'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,'
我想要做的是使用 PHP 以编程方式规范化此数据,以仅提取状态并删除其他所有内容。所以鉴于上面的例子,我想这样转换它们:
'plattsburgh, new york' => 'NY'
'California' => 'CA'
'Central Valley,Ca/ Ptld, Oregon' => 'OR'
'Bay area,CA' => 'CA'
'new port richey florida' => 'CA'
'HAMPTON ROADS AREA' => ''
'DC Metro area' => 'DC'
'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,' => 'PA,CO,NY,MD'
是否有一种简洁明了的方法来完成此任务?
您正在做的是获取用户输入的数据,并尝试以可靠的方式对其进行格式化。正如我将在我的示例中展示的那样,这通常是一件坏事,因为用户输入的数据可能充满不一致。我建议改进您的数据收集(让用户填写布局合理的表格,地址的每个部分都有单独的字段,或者让他们 select 从州列表中选择一个州)。但现在,我们将研究如何尽可能准确地格式化您的数据。
实现您想要做的事情的最简单方法是生成 "map" 数据,并使用地图格式化您的数据。
例如,我创建了两张地图,如下所示。第一个是 "name" 到 "a2" 的映射。该数组包含状态名称(大写)作为索引,状态 a2 作为值。
$nameToA2Map = array (
'ALABAMA' => 'AL',
'ALASKA' => 'AK',
'AMERICAN SAMOA' => 'AS',
'ARIZONA' => 'AZ',
'ARKANSAS' => 'AR',
'CALIFORNIA' => 'CA',
'COLORADO' => 'CO',
'CONNECTICUT' => 'CT',
'DELAWARE' => 'DE',
'DISTRICT OF COLUMBIA' => 'DC',
'FEDERATED STATES OF MICRONESIA' => 'FM',
'FLORIDA' => 'FL',
'GEORGIA' => 'GA',
'GUAM' => 'GU',
'HAWAII' => 'HI',
'IDAHO' => 'ID',
'ILLINOIS' => 'IL',
'INDIANA' => 'IN',
'IOWA' => 'IA',
'KANSAS' => 'KS',
'KENTUCKY' => 'KY',
'LOUISIANA' => 'LA',
'MAINE' => 'ME',
'MARSHALL ISLANDS' => 'MH',
'MARYLAND' => 'MD',
'MASSACHUSETTS' => 'MA',
'MICHIGAN' => 'MI',
'MINNESOTA' => 'MN',
'MISSISSIPPI' => 'MS',
'MISSOURI' => 'MO',
'MONTANA' => 'MT',
'NEBRASKA' => 'NE',
'NEVADA' => 'NV',
'NEW HAMPSHIRE' => 'NH',
'NEW JERSEY' => 'NJ',
'NEW MEXICO' => 'NM',
'NEW YORK' => 'NY',
'NORTH CAROLINA' => 'NC',
'NORTH DAKOTA' => 'ND',
'NORTHERN MARIANA ISLANDS' => 'MP',
'OHIO' => 'OH',
'OKLAHOMA' => 'OK',
'OREGON' => 'OR',
'PALAU' => 'PW',
'PENNSYLVANIA' => 'PA',
'PUERTO RICO' => 'PR',
'RHODE ISLAND' => 'RI',
'SOUTH CAROLINA' => 'SC',
'SOUTH DAKOTA' => 'SD',
'TENNESSEE' => 'TN',
'TEXAS' => 'TX',
'UTAH' => 'UT',
'VERMONT' => 'VT',
'VIRGIN ISLANDS' => 'VI',
'VIRGINIA' => 'VA',
'WASHINGTON' => 'WA',
'WEST VIRGINIA' => 'WV',
'WISCONSIN' => 'WI',
'WYOMING' => 'WY',
);
作为故障保险,我们也将构建状态 a2 值的数组,因为您数据的某些字段不包含状态名称:
$a2Map = array (
0 => 'AL',
1 => 'AK',
2 => 'AS',
3 => 'AZ',
4 => 'AR',
5 => 'CA',
6 => 'CO',
7 => 'CT',
8 => 'DE',
9 => 'DC',
10 => 'FM',
11 => 'FL',
12 => 'GA',
13 => 'GU',
14 => 'HI',
15 => 'ID',
16 => 'IL',
17 => 'IN',
18 => 'IA',
19 => 'KS',
20 => 'KY',
21 => 'LA',
22 => 'ME',
23 => 'MH',
24 => 'MD',
25 => 'MA',
26 => 'MI',
27 => 'MN',
28 => 'MS',
29 => 'MO',
30 => 'MT',
31 => 'NE',
32 => 'NV',
33 => 'NH',
34 => 'NJ',
35 => 'NM',
36 => 'NY',
37 => 'NC',
38 => 'ND',
39 => 'MP',
40 => 'OH',
41 => 'OK',
42 => 'OR',
43 => 'PW',
44 => 'PA',
45 => 'PR',
46 => 'RI',
47 => 'SC',
48 => 'SD',
49 => 'TN',
50 => 'TX',
51 => 'UT',
52 => 'VT',
53 => 'VI',
54 => 'VA',
55 => 'WA',
56 => 'WV',
57 => 'WI',
58 => 'WY',
);
然后我们可以使用这些数组来格式化您的数据:
//we'll format the states into this array
$statesFormatted = array();
//if your states data is a string like displayed in your post:
$states = "
'plattsburgh, new york'
'California'
'Central Valley,Ca/ Ptld, Oregon'
'Bay area,CA'
'new port richey florida'
'HAMPTON ROADS AREA'
'DC Metro area'
'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,'
";
//read the string into an array, by splitting up the new lines:
$stateFields = explode("\n",$states);
//if your states data is an array:
$stateFields = array(
'plattsburgh, new york',
'California',
'Central Valley,Ca/ Ptld, Oregon',
'Bay area,CA',
'new port richey florida',
'HAMPTON ROADS AREA',
'DC Metro area',
'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,',
);
//go through each state field
foreach($stateFields as $field) {
//remove the single quotes?
//$field = str_replace("'","",$field);
//first, we look to try and match a name to a2
foreach($nameToA2Map as $name => $a2) {
//if the name can be found in the field
if(strpos(strtoupper($field),$name) !== false) {
//match!!
$statesFormatted[$field] = $a2;
//we break here, as we don't need to search anymore
break;
}
}
//lets check that we found a match
if(!isset($statesFormatted[$field])) {
//we didn't find a match, lets smartly try and find an a2 value in the field
foreach($a2Map as $a2) {
if(preg_match("/[\W]".$a2."[\W]/", $field) >= 1) {
//match!!
$statesFormatted[$field] = $a2;
//we break here, as we don't need to search anymore
break;
}
}
}
//if we still can't find a match, then we we do some sort of fail-safe here...
if(!isset($statesFormatted[$field])) {
$statesFormatted[$field] = "COULD NOT MATCH!";
}
}
echo "<pre>";
print_r($statesFormatted);
echo "</pre>";
以上将输出:
Array
(
[plattsburgh, new york] => NY
[California] => CA
[Central Valley,Ca/ Ptld, Oregon] => OR
[Bay area,CA] => COULD NOT MATCH!
[new port richey florida] => FL
[HAMPTON ROADS AREA] => COULD NOT MATCH!
[DC Metro area] => COULD NOT MATCH!
[Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,] => CO
)
如果您注意到,在代码的最后几行中,我会进行最后检查以查看是否无法匹配某个字段,并为其赋值 "COULD NOT MATCH!"。在这些字段中,用户输入的数据太不一致而无法轻松匹配。您可以将这些字段视为不一致的数据,或者向 $nameToA2Map
数组添加其他条件。不过我不建议这样做。
希望对您有所帮助。
我有一个从用户那里收集状态数据的遗留表格。它在文本字段中收集数据,因此用户基本上可以输入他们想要的任何内容。一些示例输入包括:
'plattsburgh, new york'
'California'
'Central Valley,Ca/ Ptld, Oregon'
'Bay area,CA'
'new port richey florida'
'HAMPTON ROADS AREA'
'DC Metro area'
'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,'
我想要做的是使用 PHP 以编程方式规范化此数据,以仅提取状态并删除其他所有内容。所以鉴于上面的例子,我想这样转换它们:
'plattsburgh, new york' => 'NY'
'California' => 'CA'
'Central Valley,Ca/ Ptld, Oregon' => 'OR'
'Bay area,CA' => 'CA'
'new port richey florida' => 'CA'
'HAMPTON ROADS AREA' => ''
'DC Metro area' => 'DC'
'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,' => 'PA,CO,NY,MD'
是否有一种简洁明了的方法来完成此任务?
您正在做的是获取用户输入的数据,并尝试以可靠的方式对其进行格式化。正如我将在我的示例中展示的那样,这通常是一件坏事,因为用户输入的数据可能充满不一致。我建议改进您的数据收集(让用户填写布局合理的表格,地址的每个部分都有单独的字段,或者让他们 select 从州列表中选择一个州)。但现在,我们将研究如何尽可能准确地格式化您的数据。
实现您想要做的事情的最简单方法是生成 "map" 数据,并使用地图格式化您的数据。
例如,我创建了两张地图,如下所示。第一个是 "name" 到 "a2" 的映射。该数组包含状态名称(大写)作为索引,状态 a2 作为值。
$nameToA2Map = array (
'ALABAMA' => 'AL',
'ALASKA' => 'AK',
'AMERICAN SAMOA' => 'AS',
'ARIZONA' => 'AZ',
'ARKANSAS' => 'AR',
'CALIFORNIA' => 'CA',
'COLORADO' => 'CO',
'CONNECTICUT' => 'CT',
'DELAWARE' => 'DE',
'DISTRICT OF COLUMBIA' => 'DC',
'FEDERATED STATES OF MICRONESIA' => 'FM',
'FLORIDA' => 'FL',
'GEORGIA' => 'GA',
'GUAM' => 'GU',
'HAWAII' => 'HI',
'IDAHO' => 'ID',
'ILLINOIS' => 'IL',
'INDIANA' => 'IN',
'IOWA' => 'IA',
'KANSAS' => 'KS',
'KENTUCKY' => 'KY',
'LOUISIANA' => 'LA',
'MAINE' => 'ME',
'MARSHALL ISLANDS' => 'MH',
'MARYLAND' => 'MD',
'MASSACHUSETTS' => 'MA',
'MICHIGAN' => 'MI',
'MINNESOTA' => 'MN',
'MISSISSIPPI' => 'MS',
'MISSOURI' => 'MO',
'MONTANA' => 'MT',
'NEBRASKA' => 'NE',
'NEVADA' => 'NV',
'NEW HAMPSHIRE' => 'NH',
'NEW JERSEY' => 'NJ',
'NEW MEXICO' => 'NM',
'NEW YORK' => 'NY',
'NORTH CAROLINA' => 'NC',
'NORTH DAKOTA' => 'ND',
'NORTHERN MARIANA ISLANDS' => 'MP',
'OHIO' => 'OH',
'OKLAHOMA' => 'OK',
'OREGON' => 'OR',
'PALAU' => 'PW',
'PENNSYLVANIA' => 'PA',
'PUERTO RICO' => 'PR',
'RHODE ISLAND' => 'RI',
'SOUTH CAROLINA' => 'SC',
'SOUTH DAKOTA' => 'SD',
'TENNESSEE' => 'TN',
'TEXAS' => 'TX',
'UTAH' => 'UT',
'VERMONT' => 'VT',
'VIRGIN ISLANDS' => 'VI',
'VIRGINIA' => 'VA',
'WASHINGTON' => 'WA',
'WEST VIRGINIA' => 'WV',
'WISCONSIN' => 'WI',
'WYOMING' => 'WY',
);
作为故障保险,我们也将构建状态 a2 值的数组,因为您数据的某些字段不包含状态名称:
$a2Map = array (
0 => 'AL',
1 => 'AK',
2 => 'AS',
3 => 'AZ',
4 => 'AR',
5 => 'CA',
6 => 'CO',
7 => 'CT',
8 => 'DE',
9 => 'DC',
10 => 'FM',
11 => 'FL',
12 => 'GA',
13 => 'GU',
14 => 'HI',
15 => 'ID',
16 => 'IL',
17 => 'IN',
18 => 'IA',
19 => 'KS',
20 => 'KY',
21 => 'LA',
22 => 'ME',
23 => 'MH',
24 => 'MD',
25 => 'MA',
26 => 'MI',
27 => 'MN',
28 => 'MS',
29 => 'MO',
30 => 'MT',
31 => 'NE',
32 => 'NV',
33 => 'NH',
34 => 'NJ',
35 => 'NM',
36 => 'NY',
37 => 'NC',
38 => 'ND',
39 => 'MP',
40 => 'OH',
41 => 'OK',
42 => 'OR',
43 => 'PW',
44 => 'PA',
45 => 'PR',
46 => 'RI',
47 => 'SC',
48 => 'SD',
49 => 'TN',
50 => 'TX',
51 => 'UT',
52 => 'VT',
53 => 'VI',
54 => 'VA',
55 => 'WA',
56 => 'WV',
57 => 'WI',
58 => 'WY',
);
然后我们可以使用这些数组来格式化您的数据:
//we'll format the states into this array
$statesFormatted = array();
//if your states data is a string like displayed in your post:
$states = "
'plattsburgh, new york'
'California'
'Central Valley,Ca/ Ptld, Oregon'
'Bay area,CA'
'new port richey florida'
'HAMPTON ROADS AREA'
'DC Metro area'
'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,'
";
//read the string into an array, by splitting up the new lines:
$stateFields = explode("\n",$states);
//if your states data is an array:
$stateFields = array(
'plattsburgh, new york',
'California',
'Central Valley,Ca/ Ptld, Oregon',
'Bay area,CA',
'new port richey florida',
'HAMPTON ROADS AREA',
'DC Metro area',
'Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,',
);
//go through each state field
foreach($stateFields as $field) {
//remove the single quotes?
//$field = str_replace("'","",$field);
//first, we look to try and match a name to a2
foreach($nameToA2Map as $name => $a2) {
//if the name can be found in the field
if(strpos(strtoupper($field),$name) !== false) {
//match!!
$statesFormatted[$field] = $a2;
//we break here, as we don't need to search anymore
break;
}
}
//lets check that we found a match
if(!isset($statesFormatted[$field])) {
//we didn't find a match, lets smartly try and find an a2 value in the field
foreach($a2Map as $a2) {
if(preg_match("/[\W]".$a2."[\W]/", $field) >= 1) {
//match!!
$statesFormatted[$field] = $a2;
//we break here, as we don't need to search anymore
break;
}
}
}
//if we still can't find a match, then we we do some sort of fail-safe here...
if(!isset($statesFormatted[$field])) {
$statesFormatted[$field] = "COULD NOT MATCH!";
}
}
echo "<pre>";
print_r($statesFormatted);
echo "</pre>";
以上将输出:
Array
(
[plattsburgh, new york] => NY
[California] => CA
[Central Valley,Ca/ Ptld, Oregon] => OR
[Bay area,CA] => COULD NOT MATCH!
[new port richey florida] => FL
[HAMPTON ROADS AREA] => COULD NOT MATCH!
[DC Metro area] => COULD NOT MATCH!
[Pennsylvania, Colorado, New York, Maryland Federal Facilities, Military Facilities,] => CO
)
如果您注意到,在代码的最后几行中,我会进行最后检查以查看是否无法匹配某个字段,并为其赋值 "COULD NOT MATCH!"。在这些字段中,用户输入的数据太不一致而无法轻松匹配。您可以将这些字段视为不一致的数据,或者向 $nameToA2Map
数组添加其他条件。不过我不建议这样做。
希望对您有所帮助。