查找并替换后缀文件中的 URL - Linux/Ubuntu

Find and replace URLs in postfix files - Linux/Ubuntu

我想监控一个特定的文件夹。 此文件夹中的每个新文件都应扫描 URL。 如果域不在定义的白名单中,则应编辑这些 URL。

示例:

blabla http://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html

白名单:

http://www.white.com

结果:

blabla httx://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html

到目前为止我尝试过的是用这个 xml:

<?xml version="1.0" ?>
<!DOCTYPE config SYSTEM "/etc/iwatch/iwatch.dtd" >
<config>
  <guard email="root@localhost" name="IWatch"/>
  <watchlist>
    <title>URL_Filter</title>
    <contactpoint email="admin@test.com" name="Administrator"/>
    <path type="single" syslog="on" alert="off" events="create" exec="sed -i 's/http/httx' %f">/var/test</path>
  </watchlist>
</config>

因此,通过 iwatch,我可以观察文件夹“/var/test”中的新文件。 使用 sed 命令,我可以将每个 "http" 替换为 "httx"。 但是我不知道如何放入白名单,这样某些 URL 就不会被替换...

--- 编辑 --- 附加信息: 我想编辑所有传入的后缀邮件,以便其中没有可点击的链接,除了白名单上的一些域。这样做的原因是为了防止网络钓鱼邮件。

Return-Path: <example@gmail.com>
X-Original-To: example@test.de
Delivered-To: example@test.de
Received: from mail-lf0-x236.google.com (mail-lf0-x236.google.com [IPv6:2a00:1450:4010:c07::236])
        by xxxxxxx.hosteurope.de (Postfix) with ESMTPS id D255223CB59
        for <example@test.de>; Mon, 11 Apr 2016 14:44:10 +0200 (CEST)
Received: by mail-lf0-x236.google.com with SMTP id c126so154788483lfb.2
        for <example@test.de>; Mon, 11 Apr 2016 05:39:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:date:message-id:subject:from:to;
        bh=WwH+NIkCWDEoIkwbeCI4pf0jP0ya/ctbQ81pUsA4G7s=;
        b=ZS3Uo/cpVGNw3k38Js2+/DxVda0y2136oy4D4hsR0G25x2UjhyVU/yUcPl6qEdxt8i
         CQXZHQbaf8pzCdDaSq4VL9RC/sIgZy3PQzj6Cyrp3WTi6SMmQ65NwNBWLVGnpPcuzNW1
         IGC5N3rjj96ndYUAxia/tTcBX7ajS3Tw9Mc8yIaO13hSXMUCrTDIFZNzHR1ib7tLDpmX
         6EVyFhquhIfJVOhcuPgWUUxHly/FmZ++ucoHR0Yozj+dc1GJ6/ZYzUAPdGICelDY7ieG
         nvA7KH6+v6/zoWlbfkO9BmGzAPs6M4LGHilOjpMf/09Z2oMiV/WRDxe0WrCebQptpm2c
         xHPg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20130820;
        h=x-gm-message-state:mime-version:date:message-id:subject:from:to;
        bh=WwH+NIkCWDEoIkwbeCI4pf0jP0ya/ctbQ81pUsA4G7s=;
        b=hAOSzKjertcsQIT/PHoZKsiKxLba8gaKOCmyNg7nmiPJjCWqobNvM5nf3sZP1Xhysi
         gGdvk9mmMugII8dsjc7mRhDkbCT1QKVz/0UBQ+CaP6sK7kGdWfdarphGgzUGA6Il5JZi
         lP4DpEQHUpG1wJ1r+dN2f+UT8tyfIwapXwo3g7FnkPLxmCq9CeqJeRlagL6vAacon8z7
         CjdTHB7fzEtYToSp+cDi3+yK4zS9p4rwF4H4Ds3bJqwM/PrcFJW0YYncDHdra5TwYf6U
         K6VRX19iUhQT4kTVFCtoNW9SU8Ri+Rc5VfvVTKRh4KwZ2uW5x8y07ucB0vZcAQdEnms4
         AWnQ==
X-Gm-Message-State: AD7BkJJEDmk9P+Kzcn1MT4lQxpU1aYU6x8uABSpohCbT7EeOFAXjT1y6n3sFcRj7tcfWc6eBAOL6bJ78jvVOlQ==
MIME-Version: 1.0
X-Received: by 10.112.63.196 with SMTP id i4mr8426739lbs.93.1460378359811;
 Mon, 11 Apr 2016 05:39:19 -0700 (PDT)
Received: by 10.114.66.51 with HTTP; Mon, 11 Apr 2016 05:39:19 -0700 (PDT)
Date: Mon, 11 Apr 2016 14:39:19 +0200
Message-ID: <CADF5gVU+C4BZCSFSiWeiBipBnDu5jTU+FVmLJbSQSbtMM9JZcQ@mail.gmail.com>
Subject: test
From: Example <example@gmail.com>
To: example@test.de
Content-Type: multipart/alternative; boundary=001a1133d4405fd878053034d55a
X-Scanned-By: MIMEDefang 2.71 on 5.38.258.144

--001a1133d4405fd878053034d55a
Content-Type: text/plain; charset=UTF-8

http://www.example.com
http://www.white.com

--001a1133d4405fd878053034d55a
Content-Type: text/html; charset=UTF-8

<div dir="ltr"><div><a href="http://www.example.com">http://www.example.com</a><br></div><a href="http://www.white.com">http://www.white.com</a><br></div>

--001a1133d4405fd878053034d55a--

您可以使用 Perl 来做到这一点。我建议从 CPAN 安装 Regexp::Common 包并使用 Regexp::Common::URI 查找 URI,然后维护一个主机名白名单并检查它们。不过对于单行来说有点长。

use strict;
use warnings;
use Regexp::Common qw /URI/;

my %whitelist = (
    'http://www.white.com' => 1,
    'http://www.example.org' => 1,
);

while (my $line = <>) {
    MATCH: foreach my $match ($line =~ /($RE{URI}{HTTP})/g ){
        # check the whitelist
        next MATCH if grep { $match =~ /^$_/i } %whitelist;

        # no whitelist entry, replace
        my $match_updated = $match;
        $match_updated =~ s/^http/httx/;
        $line =~ s/$match/$match_updated/;
    }
    print $line;
}

将其保存为有意义的内容,也许 remove_phishing_links.pl 在 iwatch 东西可以访问的目录中。我正在做 ~,但我不知道这是否可行。现在你可以在你的 iwatch 文件中这样调用它。

<path 
  type="single" 
  syslog="on" 
  alert="off" 
  events="create" 
  exec="perl -i ~/remove_phishing_links.pl %f">/var/test</path>

它会像sed命令一样,就地编辑%f中的文件。它逐行读取,找到 http URI,检查它们是否以任何白名单条目开头,如果不是,则将 http 替换为 httx

请注意,这不适用于 base64 编码的 MIME 电子邮件,或者如果 URI 中有换行符。

如果不想安装Regexp::Common,也可以借用CPAN上的regular expression for URIs from the URI module documentation,改成只找https?.

刚刚意识到 bash 脚本不是必需的,我们可以使用以下 一行代码来完成,但阅读起来真的很神秘

输入数据:

$ cat data
sdfsdfsdfsdf http://www.whitedomain.com/red.html
bla http://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html
$ cat whitelist 
http://www.white.com
http://www.whitedomain.com
$

最终输出:

$ sed -r '/'"$(sed -r 's/\/\\/g;s/\//\\//g;s/\^/\^/g;s/\[/\[/g;s/'\''/'\'"\\"\'\''/g;s/\]/\]/g;s/\*/\*/g;s/$/\$/g;s/\./\./g' whitelist | paste -s -d '|')"'/! s/http/httx/g' data
sdfsdfsdfsdf http://www.whitedomain.com/red.html
bla httx://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html
$

解释:

内部子 shell 命令的输出是一个正则表达式(在 sed 替换命令期间过滤掉行)

$ sed -r 's/\/\\/g;s/\//\\//g;s/\^/\^/g;s/\[/\[/g;s/'\''/'\'"\\"\'\''/g;s/\]/\]/g;s/\*/\*/g;s/$/\$/g;s/\./\./g' whitelist | paste -s -d '|'
http:\/\/www\.white\.com|http:\/\/www\.whitedomain\.com

流量:

  1. 使用内部子 shell 命令动态形成正则表达式,转义 sed 中的所有元字符,然后将其通过管道传输到 paste 以添加交替
  2. sed 命令中使用上述输出过滤掉没有任何白名单域的行,并使用这些行将 http 替换为 httx

Edit1:由于 sed 是面向行的,因此您必须将数据转换为这样的文本行:

$ cat data1 
<div dir="ltr"><div><a href="http://www.white.com">http://www.white.com</a><br></div><a href="http://www.example.com">http://www.example.com</a><br></div>
$ cat whitelist 
http://www.white.com
http://www.whitedomain.com
$ sed 's/</\n</g' data1 | sed -r '/'"$(sed -r 's/\/\\/g;s/\//\\//g;s/\^/\^/g;s/\[/\[/g;s/'\''/'\'"\\"\'\''/g;s/\]/\]/g;s/\*/\*/g;s/$/\$/g;s/\./\./g' whitelist | paste -s -d '|')"'/! s/http/httx/g'

<div dir="ltr">
<div>
<a href="http://www.white.com">http://www.white.com
</a>
<br>
</div>
<a href="httx://www.example.com">httx://www.example.com
</a>
<br>
</div>
$