标准化来自不同地区的数字格式

Question

我正在对旧系统进行大规模清理，我有一个 SQL 服务器数据库 table，其中有一个 TEXT 类型的列，其中包含数字（包括金钱）数据存储（以及文本数据），通常以本地化格式或拼写错误。我需要将数据标准化为美国数字标准。

部分数据示例：

,000 - Good!
.000 - Bad, should have been ,000
,000.000 - Bad, should have been ,000,000
,000.000.00 - Bad, should have been ,000,000.00
.000.000,00 - Bad, should have been ,000,000.00
,.000 - Bad, should have been ,000
500.000 - Bad, should have been 500,000
1.325% - Good!

我举了一些例子，因为我想说明发现和纠正问题的一些困难。我假设句点后跟 3 位数字应该是逗号（除非它可能是精确的 % 而不是 $），但句点后跟 2 位数字是正确的。任何人有任何关于在 SQL 中清理此问题的建议或可能更好的开箱即用解决方案吗？

Answer 1

我同意 Sean 和 Shnugo 的观点，这会让你入门，但也强调它为何如此容易失败。除了示例数据中的问题类型之外的任何问题都可能导致问题。

declare @table table (ID int identity (1,1), c1 varchar(64))
insert into @table
values
(',000'), --good
('.000'), -- Bad, should have been ,000
(',000.000'), -- Bad, should have been ,000,000
(',000.000.00'), -- Bad, should have been ,000,000.00
(',.000'), -- Bad, should have been ,000
('500.000'), -- Bad, should have been 500,000
('1.325%'), -- Good!
('1,325%') -- bad!

select
    *,
    case
        when c1 like '%\%%' escape '\' then replace(c1,',','.') --simply replaces commas with periods for % signed values
        else 
            case                                                --simply replaces periods for commans for non % signed values, and takes into account ,00 at the end should be .00
                                                                --also handles double commas, once
                when left(right(replace(replace(c1,'.',','),',,',','),3),1) = ','
                then stuff(replace(replace(c1,'.',','),',,',','),len(replace(replace(c1,'.',','),',,',',')) - 2,1,'.')
                else replace(replace(c1,'.',','),',,',',')      
            end 
    end
from @table

Answer 2

扩展我之前发表的评论：

对于将现有数据一次性转换为一种新格式，您可以分步解决问题：

使用适当的数字数据类型向 table 添加可为空的列，例如Decimal(16,4)，初始化为NULL。

根据现有数据的语义，附加列可能会有用。捕获单位可能是有意义的，例如美元或百分比，以及比例，即小数点右侧的位数。

开始一步一步地转换数据。短而简单的模式可以先省去，例如：

-- Process "$n.nn".
update MyTable
  set DecimalValue = Cast( TextValue as Decimal(16,4) ),
    Unit = 'USD', Scale = 2 -- If desired.
  where DecimalValue is NULL and TextValue like '$[0-9].[0-9][0-9]';

提示：将语句保存在存储过程或文本文件中，以便您可以刷新转换后的数据并在积累智慧后重新开始。

更复杂的数据将需要额外的转换逻辑，例如：

-- Process "$n.nnn,nn".
update MyTable
  set DecimalValue = Cast( Replace( Replace( TextValue, '.', '' ), ',', '.' ) as Decimal(16,4) )
  where DecimalValue is NULL and TextValue like '$[0-9].[0-9][0-9][0-9],[0-9][0-9]';

模式可以在适当的地方组合在一个语句中：

-- Process ".nn%", "n.nn%" and "nn.nn%".
update MyTable
  set DecimalValue = Cast( Replace( TextValue, '%', '' ) as Decimal(16,4) ),
    Unit = 'percent', Scale = 2 -- If desired.
  where DecimalValue is NULL and (
    TextValue like '.[0-9][0-9]*%' escape '*' or
    TextValue like '[0-9].[0-9][0-9]*%' escape '*' or
    TextValue like '[0-9][0-9].[0-9][0-9]*%' escape '*' );

随着转换的进行，您可以查看剩余的文本值，where DecimalValue is NULL，看看哪些模式有意义，哪些需要手动转换，哪些数据根本无法挽救。

标准化来自不同地区的数字格式

Standardize numeric formats from different regions

tsql

sql-server

validation

sql-server-2008-r2