MySQL 忽略异常值
MySQL Ignoring Outliers
我必须向同事提供一些数据,但我在 MySQL 中分析这些数据时遇到了问题。
我有 1 个 table 叫作 'payments'。每笔付款都有以下列:
- 客户(我们的客户,例如银行)
- Amount_gbp(交易金额的英镑等值)
- 货币
- Origin_country
- Client_type(个人或公司)
我编写了非常简单的查询,例如:
SELECT
AVG(amount_GBP),
COUNT(client) AS '#Of Results'
FROM payments
WHERE client_type = 'individual'
AND amount_gbp IS NOT NULL
AND currency = 'TRY'
AND country_origin = 'GB'
AND date_time BETWEEN '2017/1/1' AND '2017/9/1'
但我真正需要做的是从平均值中消除离群值 AND/OR 仅包括一些与平均值的标准差范围内的结果。
例如,忽略 top/bottom 10 个结果的 2% 等。
AND/OR 忽略任何超出平均值 2 个 STDEV 范围的结果
有人可以帮忙吗?
--- 编辑后的答案 -- 试着让我知道 ---
你最好的办法是用 avg 和 std_dev 值创建一个临时 table 并与它们进行比较。让我知道这是否不可行:
CREATE TEMPORARY TABLE payment_stats AS
SELECT
AVG(p.amount_gbp) as avg_gbp,
STDDEV(amount_gbp) as std_gbp,
(SELECT MIN(srt.amount_gbp) as max_gbp
FROM (SELECT amount_gbp
FROM payments
<... repeat where no p. ...>
ORDER BY amount_gbp DESC
LIMIT <top_numbers to ignore>
) srt
) max_g,
(SELECT MAX(srt.amount_gbp) as min_gbp
FROM (SELECT amount_gbp
FROM payments
<... repeat where no p. ...>
ORDER BY amount_gbp ASC
LIMIT <top_numbers to ignore>
) srt
) min_g
FROM payments
WHERE client_type = 'individual'
AND amount_gbp IS NOT NULL
AND currency = 'TRY'
AND country_origin = 'GB'
AND date_time BETWEEN '2017/1/1' AND '2017/9/1';
然后您可以与温度进行比较table
SELECT
AVG(p.amount_gbp) as avg_gbp,
COUNT(p.client) AS '#Of Results'
FROM payments p
WHERE
p.amount_gbp >= (SELECT (avg_gbp - std_gbp*2)
FROM payment_stats)
AND p.amount_gbp <= (SELECT (avg_gbp + std_gbp*2)
FROM payment_stats)
AND p.amount_gbp > (SELECT min_g FROM payment_stats)
AND p.amount_gbp < (SELECT max_g FROM payment_stats)
AND p.client_type = 'individual'
AND p.amount_gbp IS NOT NULL
AND p.currency = 'TRY'
AND p.country_origin = 'GB'
AND p.date_time BETWEEN '2017/1/1' AND '2017/9/1';
-- 稍后
DROP TEMPORARY TABLE payment_stats;
注意我不得不重复 WHERE 条件。还要将 *2
更改为您需要的任何 <factor>
!
还是呸!
每次比较都会检查不同的统计数据
让我知道这是否更好
我必须向同事提供一些数据,但我在 MySQL 中分析这些数据时遇到了问题。
我有 1 个 table 叫作 'payments'。每笔付款都有以下列:
- 客户(我们的客户,例如银行)
- Amount_gbp(交易金额的英镑等值)
- 货币
- Origin_country
- Client_type(个人或公司)
我编写了非常简单的查询,例如:
SELECT
AVG(amount_GBP),
COUNT(client) AS '#Of Results'
FROM payments
WHERE client_type = 'individual'
AND amount_gbp IS NOT NULL
AND currency = 'TRY'
AND country_origin = 'GB'
AND date_time BETWEEN '2017/1/1' AND '2017/9/1'
但我真正需要做的是从平均值中消除离群值 AND/OR 仅包括一些与平均值的标准差范围内的结果。
例如,忽略 top/bottom 10 个结果的 2% 等。 AND/OR 忽略任何超出平均值 2 个 STDEV 范围的结果
有人可以帮忙吗?
--- 编辑后的答案 -- 试着让我知道 ---
你最好的办法是用 avg 和 std_dev 值创建一个临时 table 并与它们进行比较。让我知道这是否不可行:
CREATE TEMPORARY TABLE payment_stats AS
SELECT
AVG(p.amount_gbp) as avg_gbp,
STDDEV(amount_gbp) as std_gbp,
(SELECT MIN(srt.amount_gbp) as max_gbp
FROM (SELECT amount_gbp
FROM payments
<... repeat where no p. ...>
ORDER BY amount_gbp DESC
LIMIT <top_numbers to ignore>
) srt
) max_g,
(SELECT MAX(srt.amount_gbp) as min_gbp
FROM (SELECT amount_gbp
FROM payments
<... repeat where no p. ...>
ORDER BY amount_gbp ASC
LIMIT <top_numbers to ignore>
) srt
) min_g
FROM payments
WHERE client_type = 'individual'
AND amount_gbp IS NOT NULL
AND currency = 'TRY'
AND country_origin = 'GB'
AND date_time BETWEEN '2017/1/1' AND '2017/9/1';
然后您可以与温度进行比较table
SELECT
AVG(p.amount_gbp) as avg_gbp,
COUNT(p.client) AS '#Of Results'
FROM payments p
WHERE
p.amount_gbp >= (SELECT (avg_gbp - std_gbp*2)
FROM payment_stats)
AND p.amount_gbp <= (SELECT (avg_gbp + std_gbp*2)
FROM payment_stats)
AND p.amount_gbp > (SELECT min_g FROM payment_stats)
AND p.amount_gbp < (SELECT max_g FROM payment_stats)
AND p.client_type = 'individual'
AND p.amount_gbp IS NOT NULL
AND p.currency = 'TRY'
AND p.country_origin = 'GB'
AND p.date_time BETWEEN '2017/1/1' AND '2017/9/1';
-- 稍后
DROP TEMPORARY TABLE payment_stats;
注意我不得不重复 WHERE 条件。还要将 *2
更改为您需要的任何 <factor>
!
还是呸!
每次比较都会检查不同的统计数据
让我知道这是否更好