无法通过第二个 NIC 建立连接(两跳)

Can't establish connection over second NIC (two hops)

我们在 Ubuntu Xenial 中配置网络路由时遇到问题。

我们有很多服务器同时安装了 Debian 8.4 (Jessie) 和 Ubuntu 16.04.2 (xenial) 和 完全相同的 网络设置(或至少就我们所见)。

它们都有两个 NIC 连接到两个 VLAN(比如说 "A" 和 "B")都可以访问 尽管其他 VLAN 说,例如,来自 VLAN "C".

两个 /etc/network/interfaces 文件的格式为:

NOTE: I faked names and IPs for the sake of better readability.

# VLAN A
auto eth0
iface eth0 inet static
address 192.168.111.xxx
netmask 255.255.255.0
broadcast 192.168.111.255
network 192.168.111.0
gateway 192.168.111.254
dns-nameservers 192.168.111.25 192.168.111.26

# VLAN B
auto eth1
iface eth1 inet static
address 192.168.222.xxx
netmask 255.255.255.0
broadcast 192.168.222.255
network 192.168.222.0
gateway 192.168.222.254 # <-- (Commented out in Ubuntu machine)
dns-nameservers 192.168.111.25 192.168.111.26

...假设 xxx 对于 Debian 机器是 100,对于 Ubuntu 机器是 200,我是 尝试从 VLAN "C" 中的 192.168.1.10 ping 到以下地址:

"B" vlan 主要用于备份和其他 "background" 流量 避免 vlan "A".

中的饱和问题

我知道用两条网络路径访问同一台机器并不常见 设置,我必须说,只有能够连接其中之一 现在其他网络不是大问题。但让我印象深刻的是 为什么 我可以访问 Debian 机器而不是 Ubuntu 机器?

Even, on the other hand, if it were working well in both platforms, we could consider closing some services (such as ssh, and backend interfaces) from NIC "A" to improve security (Our firewall only allows access to vlan "B" from our IT staff vlan).

当然, 正如在之前的 interfaces 片段中评论的那样,gateway 行在 Ubuntu 台机器中被注释掉了,但那是因为,网络 否则该机器的初始化失败。也就是说,事实上,我们是 正在尝试解决。

但是两台机器路由 table 几乎相同。唯一的区别 我可以看到 Ubuntu 机器中的 onlink 标志:

myUser@debianMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0
192.168.111.0/24 dev eth0  proto kernel  scope link  src 192.168.111.100
192.168.222.0/24 dev eth1  proto kernel  scope link  src 192.168.222.100


myUser@ubuntuMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0 onlink
192.168.111.0/24 dev eth0  proto kernel  scope link  src 192.168.111.200
192.168.222.0/24 dev eth1  proto kernel  scope link  src 192.168.222.200

...但我能够通过以下命令将其删除:

myUser@ubuntuMachine:~$ sudo ip route replace default via 192.168.111.254 dev eth0
myUser@ubuntuMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0
192.168.111.0/24 dev eth0  proto kernel  scope link  src 192.168.111.200
192.168.222.0/24 dev eth1  proto kernel  scope link  src 192.168.222.200

并没有解决问题。

在那之后,我还尝试取消注释 gateway 行 'VLAN B' ,因为我 说,它在 /etc/network/interfaces 文件中被注释掉并试图 重新启动网络,但这是发生了什么:

myUser@ubuntuMachine:~$ sudo /etc/init.d/networking restart
[....] Restarting networking (via systemctl): networking.serviceJob for networking.service failed because the control process exited with error code. See "systemctl status networking.service" and "journalctl -xe" for details.
failed!

...onlink 标志又回来了。

As a note, commenting out that line again and issuing new /etc/init.d/networking restart command, the output is the same until the machine is rebooted, (even networking, despite the VLAN B default gateyay issue, continues working as usual).

以下是建议命令的输出:

myUser@ubuntuMachine:~$ sudo systemctl status networking.service
● networking.service - Raise network interfaces
   Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
  Drop-In: /run/systemd/generator/networking.service.d
           └─50-insserv.conf-$network.conf
   Active: failed (Result: exit-code) since jue 2017-12-21 14:55:29 CET; 42s ago
     Docs: man:interfaces(5)
  Process: 8552 ExecStop=/sbin/ifdown -a --read-environment --exclude=lo (code=exited, status=0/SUCCESS)
  Process: 8940 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
  Process: 8934 ExecStartPre=/bin/sh -c [ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-envi
 Main PID: 8940 (code=exited, status=1/FAILURE)

dic 21 14:55:29 ubuntuMachine systemd[1]: Stopped Raise network interfaces.
dic 21 14:55:29 ubuntuMachine systemd[1]: Starting Raise network interfaces...
dic 21 14:55:29 ubuntuMachine ifup[8940]: RTNETLINK answers: File exists
dic 21 14:55:29 ubuntuMachine ifup[8940]: Failed to bring up eth1.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILUR
dic 21 14:55:29 ubuntuMachine systemd[1]: Failed to start Raise network interfaces.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Unit entered failed state.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Failed with result 'exit-code'.

...以及sudo journalctl -xe的有意义的部分:

dic 21 14:55:29 ubuntuMachine sudo[8922]:   myUser : TTY=pts/0 ; PWD=/home/myUser ; USER=root ; COMMAND=/etc/init.d/networking restart
dic 21 14:55:29 ubuntuMachine sudo[8922]: pam_unix(sudo:session): session opened for user root by myUser(uid=0)
dic 21 14:55:29 ubuntuMachine systemd[1]: Stopped Raise network interfaces.
-- Subject: Unit networking.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has finished shutting down.
dic 21 14:55:29 ubuntuMachine systemd[1]: Starting Raise network interfaces...
-- Subject: Unit networking.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has begun starting up.
dic 21 14:55:29 ubuntuMachine ifup[8940]: RTNETLINK answers: File exists
dic 21 14:55:29 ubuntuMachine ifup[8940]: Failed to bring up eth1.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
dic 21 14:55:29 ubuntuMachine systemd[1]: Failed to start Raise network interfaces.
-- Subject: Unit networking.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has failed.
--
-- The result is failed.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Unit entered failed state.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Failed with result 'exit-code'.
dic 21 14:55:29 ubuntuMachine sudo[8922]: pam_unix(sudo:session): session closed for user root

我用谷歌搜索了很多关于能够找到一些相关信息但是 none 完全回答我的问题:

myUser@bothMachines:~$ sudo cat /etc/iproute2/rt_tables
#
# reserved values
#
255     local
254     main
253     default
0       unspec
#
# local
#
#1      inr.ruhep

所以我最后的假设是这可能只是一个实现差异 在内核版本之间,并且 ubuntu 一个是最新的,这个 可能是正确的行为 所以,在现代内核中,我需要使用两个 不同的路由 tables(但我不确定,也不知道为什么...)。

myUser@debianMachine:~$ sudo uname -a
Linux debianMachine 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux

myUser@ubuntuMachine:~$ sudo uname -a
Linux ubuntuMachine 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

因此,问题是:

我们是不是在 Ubuntu 机器上做错了什么(或者其中有一些错误)?或者,相反,这是正确的行为,我们被迫设置更复杂的路由模式(通过每 vlan 路由或使用两个路由 table 使两个默认网关再次工作)?

编辑:

现在我尝试添加静态路由来解决问题:

myUser@ubuntuMachine:~$ sudo ip route add 192.168.1.0/24 via 192.168.222.254 dev eth1

...但这冻结了我的 ssh 连接(认为是 NIC A),即使我可以连接认为是 NIC B(位于 192.168.111.200)

两条规则同时出现似乎是不可能的:

myUser@ubuntuMachine:~$ sudo ip route add 192.168.1/24 via 102.168.111.254 dev eth0
myUser@ubuntuMachine:~$ sudo ip route add 192.168.1/24 via 192.168.222.254 dev eth1
RTNETLINK answers: File exists

编辑 2:

我终于找到了 Linux Advanced Routing & Traffic Control HOWTO which seems to be more accurate than all other documentation I found and specifically in its Chapter 4. Rules - routing policy database 我看到以下文字:

If you want to use this feature, make sure that your kernel is compiled with the "IP: advanced router" and "IP: policy routing" features

...所以我的所有观点都表明我之前关于内核实现差异的假设是正确的,并且具体的差异在于编译了这两个功能。

不是权威答案,而是我的第一次工作尝试(应用我设法理解的内容):

sudo ip route add 192.168.1.0/24 via 192.168.222.254 from 192.168.222.200 dev eth1 table 253 
sudo ip rule add from 192.168.222.200 table 253

Update: from and devarguments in the ip route command aren't required (it works perfetly well without them).

...在发出第一个命令后我无法连接,但在发出第二个命令后是。

其背后的逻辑来自我在 this document 中找到的这段文字:

Linux-2.x can pack routes into several routing tables identified by a number in the range from 1 to 255 or by name from the file /etc/iproute2/rt_tables By default all normal routes are inserted into the main table (ID 254) and the kernel only uses this table when calculating routes.

Actually, one other table always exists, which is invisible but even more important. It is the local table (ID 255). This table consists of routes for local and broadcast addresses. The kernel maintains this table automatically and the administrator usually need not modify it or even look at it.

事实上,我最终使用了另一个路由 table,由其 id (253) 标识,而不是我现在理解的它只是一个别名(在 [=145 中定义) =] 文件).

...再次检查该文件,我现在看到已经为该路由 table 定义了一个别名 ("default")(在 "main" 旁边正如我之前粘贴的文本片段所说,确实是 254。

我还不知道这个命名背后的逻辑是什么(我的意思是 "default" 用于 253 路由 table),如果出于任何原因,最好使用较低的路由 tables (1, 2, 3...) 就像 this solution (已经在问题中提到)一样。

但是,为了简单起见,如果我们不打算构建复杂的路由策略,而只是想解决这个 连接问题 ,我想这可能是一个很好的选择使用类似 (not yet tested):

的解决方案
gateway 192.168.222.254 table 253
post-up ip rule add from 192.168.222.200 table 253

I still need to test and check if I need an additional via 192.168.222.254 in the gateway row or if it won't work at all and need to add it with another post-up command instead.

I will update this answer with the results.

编辑 1: 同样适​​用于 default 路由:

sudo ip route add default from 192.168.222.200 via 192.168.222.254 table 253
sudo ip rule add from 192.168.222.200 table 253

编辑 2: 第一种(现在完全 ¹)工作方法

在测试机上玩了一段时间后,我认为最好的解决办法是在 /etc/network/interfaces 文件中的第二个网卡配置中添加以下行:

gateway 192.168.222.254 table 1
post-up ip rule add from 192.169.222.200 table 1
pre-down ip rule del from 192.168.222.200 table 1
post-up ip route add 192.188.222.0/24 dev eth1 src 192.168.222.200 table 1

评论:

  • table 1 添加到 gateway 关键字效果很好,因此附加(可读性较差)post-up 命令不需要添加默认路由。

    • ...事实上,对第一个 NIC 使用特定的 table(除了 main)以及与我们对第二个 NIC 使用的规则类似的规则会这是一个坏主意,因为该规则仅在 192.168.111.200 将用作源地址时才适用 ,因此不会有任何 "default default gateway"。在 main 路由 table 中保留第一个 NIC 配置将使所有 ("locally generated") 到远程 LAN 的传出连接将通过我们的第一个 default默认网关
  • 第一个 post-up 命令添加了一条规则,即带有该 NIC 源地址的数据包应该使用 table 1 进行路由(否则我们的新默认网关将不会使用).

  • pre-down 命令删除该规则。它不是强制性的,但如果没有它,多次网络服务重启将每次都重复此规则。

  • 我也尝试使用 dev eth1 而不是 from 192.169.222.200(以避免必须重复网络地址),但它没有用。我猜 "response" 数据包使用哪个 NIC 是 "not yet decided".

  • 我将 table 1 用于 eth1(我们的第二个 NIC),我可以将 table 2 用于最终的第三个等等在。不需要为第一个 NIC 指定任何 table/rule,因为它涉及 main table(不是 "default":见下面的注释)。

  • 最后(¹)第二个 post-up 命令使所有事情都运行良好,因为(正如我现在意识到的那样)仅(第一次匹配)使用一个路由 table 所以默认网络路由(在界面启动时自动创建)不适用,因为它是在 table main.

    中创建的
    • 我仍然不知道是否有办法强制将其直接装箱到table 1.

NOTE: By command sudo ip rule list we can see current routing rules as follows:

0:      from all lookup local 
32765:  from 192.168.222.200 lookup 1 
32766:  from all lookup main 
32767:  from all lookup default

As I can understand, they are added decreasingly from 32767 to 0 and tried increasingly until one matches. Last two ones and the "0" were already defined by default. The former because of the logic I previously cited from this document but that documents says that rules starts from "1" so I guess "0" should also be some predefined "default starting point".

编辑 3:

正如我在编辑 2(问题)中所说,我发现这个 Linux Advanced Routing & Traffic Control HOWTO 对我澄清事情有很大帮助。

具体来说,Routing for multiple uplinks/providers 一章对我理解具有 "network loops" 的设置非常有用(即使在我们的例子中,我们不充当互联网的路由器)。