docker-compose up airflow-init 挂起:容器之间没有网络连接

docker-compose up airflow-init hangs: no network connection between containers

我正在尝试使用 docker-compose 设置气流实例,如 official docs 中所述,但我卡在了 airflow-init 部分。看起来容器之间没有连接,但我不知道如何修复它。

我使用的 docker-compose.yaml 与文档中描述的完全相同。可以在这里下载:https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml

目前,我在 shell:

中看到了这个
~/dwn $ docker-compose up airflow-init
51ad8448b197_dwn_redis_1 is up-to-date
70409dec742c_dwn_postgres_1 is up-to-date
Starting dwn_airflow-init_1 ... done
Attaching to dwn_airflow-init_1
airflow-init_1       | BACKEND=postgresql+psycopg2
airflow-init_1       | DB_HOST=postgres
airflow-init_1       | DB_PORT=5432

文档说我应该看到这样的东西:

airflow-init_1       | Upgrades done
airflow-init_1       | Admin user airflow created
airflow-init_1       | 2.1.2
start_airflow-init_1 exited with code 0

但是那个命令只是挂起,永远不会退出。 Htop 向我显示 netcat 在这个容器中 运行 并且它正在尝试连接到 postgres:

nc -zvvn 172.19.0.3 5432

curl 显示超时:

~/dwn $ docker exec -it dwn_airflow-init_1 curl postgres:5432
curl: (7) Failed to connect to postgres port 5432: Connection timed out

为什么挂起?


我尝试了一些方法来解决这个问题:

  1. 我尝试将 postgres 服务中的 ports 选项设置为 5432:5432 - 没有效果

  2. 我尝试设置 links 选项 - 没有效果

  3. Other question 建议系统熵太低-不,有足够的熵

  4. 有足够的可用 RAM,CPU,磁盘 space

  5. 我试过像 那样设置网络 - 更糟糕的是,容器名称没有解析:

    ~/dwn $ docker exec -it dwn_airflow-init_1 curl postgres:5432
    curl: (6) Could not resolve host: postgres
    
  6. 我尝试按照 中的建议重置 iptables - 没有效果


一些系统信息:


日志! (根据@larsks 的要求)

~/dwn $ docker-compose ps
       Name                     Command                  State                        Ports                  
-------------------------------------------------------------------------------------------------------------
dwn_airflow-init_1   /usr/bin/dumb-init -- /ent ...   Up             8080/tcp                                
dwn_postgres_1       docker-entrypoint.sh postgres    Up (healthy)   5432/tcp                                
dwn_redis_1          docker-entrypoint.sh redis ...   Up (healthy)   0.0.0.0:6379->6379/tcp,:::6379->6379/tcp
~/dwn $ docker-compose logs postgres
Attaching to dwn_postgres_1
postgres_1           | The files belonging to this database system will be owned by user "postgres".
postgres_1           | This user must also own the server process.
postgres_1           | 
postgres_1           | The database cluster will be initialized with locale "en_US.utf8".
postgres_1           | The default database encoding has accordingly been set to "UTF8".
postgres_1           | The default text search configuration will be set to "english".
postgres_1           | 
postgres_1           | Data page checksums are disabled.
postgres_1           | 
postgres_1           | fixing permissions on existing directory /var/lib/postgresql/data ... ok
postgres_1           | creating subdirectories ... ok
postgres_1           | selecting dynamic shared memory implementation ... posix
postgres_1           | selecting default max_connections ... 100
postgres_1           | selecting default shared_buffers ... 128MB
postgres_1           | selecting default time zone ... Etc/UTC
postgres_1           | creating configuration files ... ok
postgres_1           | running bootstrap script ... ok
postgres_1           | performing post-bootstrap initialization ... ok
postgres_1           | initdb: warning: enabling "trust" authentication for local connections
postgres_1           | You can change this by editing pg_hba.conf or using the option -A, or
postgres_1           | --auth-local and --auth-host, the next time you run initdb.
postgres_1           | syncing data to disk ... ok
postgres_1           | 
postgres_1           | 
postgres_1           | Success. You can now start the database server using:
postgres_1           | 
postgres_1           |     pg_ctl -D /var/lib/postgresql/data -l logfile start
postgres_1           | 
postgres_1           | waiting for server to start....2021-07-17 07:31:38.491 UTC [47] LOG:  starting PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
postgres_1           | 2021-07-17 07:31:38.493 UTC [47] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
postgres_1           | 2021-07-17 07:31:38.499 UTC [48] LOG:  database system was shut down at 2021-07-17 07:31:35 UTC
postgres_1           | 2021-07-17 07:31:38.521 UTC [47] LOG:  database system is ready to accept connections
postgres_1           |  done
postgres_1           | server started
postgres_1           | CREATE DATABASE
postgres_1           | 
postgres_1           | 
postgres_1           | /usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
postgres_1           | 
postgres_1           | 2021-07-17 07:31:39.613 UTC [47] LOG:  received fast shutdown request
postgres_1           | waiting for server to shut down....2021-07-17 07:31:39.615 UTC [47] LOG:  aborting any active transactions
postgres_1           | 2021-07-17 07:31:39.616 UTC [47] LOG:  background worker "logical replication launcher" (PID 54) exited with exit code 1
postgres_1           | 2021-07-17 07:31:39.616 UTC [49] LOG:  shutting down
postgres_1           | 2021-07-17 07:31:39.644 UTC [47] LOG:  database system is shut down
postgres_1           |  done
postgres_1           | server stopped
postgres_1           | 
postgres_1           | PostgreSQL init process complete; ready for start up.
postgres_1           | 
postgres_1           | 2021-07-17 07:31:39.741 UTC [1] LOG:  starting PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
postgres_1           | 2021-07-17 07:31:39.741 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
postgres_1           | 2021-07-17 07:31:39.741 UTC [1] LOG:  listening on IPv6 address "::", port 5432
postgres_1           | 2021-07-17 07:31:39.748 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
postgres_1           | 2021-07-17 07:31:39.756 UTC [75] LOG:  database system was shut down at 2021-07-17 07:31:39 UTC
postgres_1           | 2021-07-17 07:31:39.781 UTC [1] LOG:  database system is ready to accept connections
postgres_1           | 2021-07-17 07:33:49.955 UTC [79] LOG:  using stale statistics instead of current ones because stats collector is not responding
postgres_1           | 2021-07-17 07:34:00.040 UTC [79] LOG:  using stale statistics instead of current ones because stats collector is not responding
postgres_1           | 2021-07-17 07:34:00.049 UTC [235] LOG:  using stale statistics instead of current ones because stats collector is not responding
postgres_1           | 2021-07-17 07:34:10.141 UTC [79] LOG:  using stale statistics instead of current ones because stats collector is not responding

当我编辑 postgres 服务以使其可从主机访问时(ports 选项)我可以看到它确实存在

~/dwn $ pg_isready -h localhost -p 5432
localhost:5432 - accepting connections

docker-compose 创建的网络如下所示:

[
    {
        "Name": "dwn_default",
        "Id": "8c4e4ab1629cd7d2cb5d532e28b0837a11bc3516ba094248294e5d734a69dc11",
        "Created": "2021-07-17T10:15:50.694208715+02:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.19.0.0/16",
                    "Gateway": "172.19.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "2c6dd1bcd0d81740ab17ff7816acd983ff053be2a8f886ef281b3e5ec1ec642b": {
                "Name": "dwn_airflow-init_1",
                "EndpointID": "945c9bd23ffb52bdee7ae9fdf32f48be623ac73cd60a5b248f919fce6aede366",
                "MacAddress": "02:42:ac:13:00:04",
                "IPv4Address": "172.19.0.4/16",
                "IPv6Address": ""
            },
            "3a79a194d97e491c75e573fa78492c9d4f73efd4d868e709c20eb23c9a0ff2a6": {
                "Name": "dwn_postgres_1",
                "EndpointID": "b3245b8ab82edc78b205485cd39c368881d7c7b2bc29f325fd3f6f6d8605d9c1",
                "MacAddress": "02:42:ac:13:00:03",
                "IPv4Address": "172.19.0.3/16",
                "IPv6Address": ""
            },
            "dd023f1d42be72d967c5045b7be29deca88caf99377e7d144c51f2212059cefa": {
                "Name": "dwn_redis_1",
                "EndpointID": "f85a6cd841028efb7fab17e40f814b0d9de300e90f9506df373d973695a38d97",
                "MacAddress": "02:42:ac:13:00:02",
                "IPv4Address": "172.19.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {
            "com.docker.compose.network": "default",
            "com.docker.compose.project": "dwn",
            "com.docker.compose.version": "1.29.2"
        }
    }
]

@jarek-potiuk 建议我应该检查 ipv6 配置。仍然行不通,但这次我遇到了一些错误。这是我所做的:

我创建了 /etc/docker/daemon.json,内容如下:

{
  "ipv6": true,
  "fixed-cidr-v6": "2001:db8:1::/64"
}

这导致了以下错误(守护进程重启后):

could not find an available, non-overlapping IPv6 address pool among the defaults to as sign to the network

可以通过为撰写文件中的每个服务设置 network_mode: bridge 来修复此错误,现在我的服务有 ipv6 地址:

[
    {
        "Name": "bridge",
        "Id": "092767c3c4137429a7caaa85a1b87c7cb977c4f02055624fa84c4d586ed9758f",
        "Created": "2021-07-17T14:42:08.353393246+02:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": true,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.17.0.0/16",
                    "Gateway": "172.17.0.1"
                },
                {
                    "Subnet": "2001:db8:1::/64",
                    "Gateway": "2001:db8:1::1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "964c9edadb8f7eb757cd7f1296c2af154ab407ef4d9872f8e613f61d64d6a443": {
                "Name": "dwn_postgres_1",
                "EndpointID": "a19bd83ff487611e78074eddafbca18e545edcf9ddc9d7851d3b6d68b7962419",
                "MacAddress": "02:42:ac:11:00:02",
                "IPv4Address": "172.17.0.2/16",
                "IPv6Address": "2001:db8:1::242:ac11:2/64"
            },
            "b45ca546c1539f5f0f1d76423bd4f071efed2e3d6e118b8811e3fd28164fab5a": {
                "Name": "dwn_airflow-init_1",
                "EndpointID": "3a2fc42dfda6a534b6840971f4b11af9c78aac2253a036f46721ed6e5659f7b9",
                "MacAddress": "02:42:ac:11:00:04",
                "IPv4Address": "172.17.0.4/16",
                "IPv6Address": "2001:db8:1::242:ac11:4/64"
            },
            "f140d9c90c24fca254e34aec549b559ec5f82bc8b14537e7249192e604110d53": {
                "Name": "dwn_redis_1",
                "EndpointID": "1c26f7afa8ada58626b67e7446347e1c4d540513df72784addcf334f99fd53d1",
                "MacAddress": "02:42:ac:11:00:03",
                "IPv4Address": "172.17.0.3/16",
                "IPv6Address": "2001:db8:1::242:ac11:3/64"
            }
        },
        "Options": {
            "com.docker.network.bridge.default_bridge": "true",
            "com.docker.network.bridge.enable_icc": "true",
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
            "com.docker.network.bridge.name": "docker0",
            "com.docker.network.driver.mtu": "1500"
        },
        "Labels": {}
    }
]

但还有另一个问题 - 名称解析停止工作:

~/dwn $ docker-compose up airflow-init
dwn_postgres_1 is up-to-date
dwn_redis_1 is up-to-date
Starting dwn_airflow-init_1 ... done
Attaching to dwn_airflow-init_1
airflow-init_1       | BACKEND=postgresql+psycopg2
airflow-init_1       | DB_HOST=postgres
airflow-init_1       | DB_PORT=5432
airflow-init_1       | ....................
airflow-init_1       | ERROR! Maximum number of retries (20) reached.
airflow-init_1       | 
airflow-init_1       | Last check result:
airflow-init_1       | $ run_nc 'postgres' '5432'
airflow-init_1       | Traceback (most recent call last):
airflow-init_1       |   File "<string>", line 1, in <module>
airflow-init_1       | socket.gaierror: [Errno -3] Temporary failure in name resolution
airflow-init_1       | Can't parse  as an IP address
airflow-init_1       | 
dwn_airflow-init_1 exited with code 1

这个其实是有文档记录的:,但是IP访问还是不行:

~/dwn $ docker exec -i -t dwn_airflow-init_1 sh -c 'echo "PING" | nc -v 172.17.0.3 6379'
172.17.0.3: inverse host lookup failed: Host name lookup failure
^C
~/dwn $ echo "PING" | ncat -v localhost 6379
Ncat: Version 7.91 ( https://nmap.org/ncat )
Ncat: Connected to ::1:6379.
+PONG
Ncat: 5 bytes sent, 7 bytes received in 0.01 seconds.

我还发现在守护进程级别禁用 ipv6 不会在容器中禁用 ipv6,因此我尝试在 postgres 容器中禁用它 。它按预期工作:

~/dwn $ docker exec -i -t dwn_postgres_1 cat /proc/sys/net/ipv6/conf/all/disable_ipv6
1

但仍然无法访问网络。

我现在没主意了。

分析的很详细。很高兴看到有人采取那么多步骤进行挖掘。

您的设置和日志中的一切看起来都不错。所以我觉得不是docker-compose的问题,一定是你环境的问题。

不过我注意到一件事,虽然我不是 100% 确定,但这可能就是原因。

我注意到您的 postgres 服务器同时侦听 IPV4 和 IPV6 网络,但是您的 docker-compose 网络仅显示 IPV4 地址。

我的假设是,虽然您为 docker 引擎启用了 IPV6,但 IPV6 被禁用(或配置错误)。

然后会发生的是,当您尝试使用 IPV6 解析解析 postgres 地址时,它在通过错误配置的 DNS 检索地址时挂起 - 因此超时。

您可以在 /etc/docker/daemon.json 中将 ipv6 设置为 false (https://docs.docker.com/config/daemon/ipv6/) 并重新启动守护程序:

{
  "ipv6": true,
  "fixed-cidr-v6": "2001:db8:1::/64"
}

嗯,我自己想出来了。

TL;DR: PEBKAC - 用户错误配置了防火墙,忘记了他告诉内核丢弃转发的数据包


让我们从头开始:docker-compose up airflow-init 只打印这个并等待一些东西:

~/dwn $ docker-compose up airflow-init
51ad8448b197_dwn_redis_1 is up-to-date
70409dec742c_dwn_postgres_1 is up-to-date
Starting dwn_airflow-init_1 ... done
Attaching to dwn_airflow-init_1
airflow-init_1       | BACKEND=postgresql+psycopg2
airflow-init_1       | DB_HOST=postgres
airflow-init_1       | DB_PORT=5432

也许主机 postgres 指向奇怪的地方:

~/dwn $ docker exec -i -t dwn_airflow-init_1 host postgres
postgres has address 172.20.0.2

不是真的,看起来像任何其他 docker ip,但此 netcat 调用仍然挂起:

 nc -zvvn 172.120.0.2 5432

这意味着 postgres 服务根本没有响应气流初始化容器。但是,postgres 响应了来自主机系统的请求。这意味着 postgresairflow 之间没有路由,即使它们在同一网络中。也许内核会丢弃转发的数据包?

~ # sysctl net/ipv4/conf/all/forwarding
net.ipv4.conf.all.forwarding = 1
~ # sysctl net/ipv6/conf/all/forwarding
net.ipv6.conf.all.forwarding = 1

转发已启用。也许防火墙会删除它们?

~ # iptables -S FORWARD
-P FORWARD ACCEPT
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o br-b4a6c0b51ae7 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o br-b4a6c0b51ae7 -j DOCKER
-A FORWARD -i br-b4a6c0b51ae7 ! -o br-b4a6c0b51ae7 -j ACCEPT
-A FORWARD -i br-b4a6c0b51ae7 -o br-b4a6c0b51ae7 -j ACCEPT
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT

看来他们可以通过了。也许 docker 以某种方式损坏了?重装,重启,还是一样。

也许 iptables 有问题?

~ # pacman -S iptables
resolving dependencies...
looking for conflicting packages...
:: iptables and iptables-nft are in conflict. Remove iptables-nft? [y/N]

哦。我安装了 nftables。那真是怪了。我的防火墙 实际上 是如何管理的?

~ # systemctl status iptables nftables
○ iptables.service - IPv4 Packet Filtering Framework
     Loaded: loaded (/usr/lib/systemd/system/iptables.service; disabled; vendor preset: disabled)
     Active: inactive (dead)

● nftables.service - Netfilter Tables
     Loaded: loaded (/usr/lib/systemd/system/nftables.service; enabled; vendor preset: disabled)
     Active: active (exited) since Mon 2021-07-19 10:30:41 CEST; 6h ago
       Docs: man:nft(8)
    Process: 824 ExecStart=/usr/bin/nft -f /etc/nftables.conf (code=exited, status=0/SUCCESS)
   Main PID: 824 (code=exited, status=0/SUCCESS)
        CPU: 9ms

还有...正向链是什么样子的?

~ # nft list chain inet filter forward                               
table inet filter {
    chain forward {
        type filter hook forward priority filter; policy accept;
        drop
    }
}

嗯……

现在,如果我删除该链会怎样:

~ # nft delete chain inet filter forward

突然,气流开始在第二个终端打印很多。达成目标:

airflow-init_1       | Admin user airflow created
airflow-init_1       | 2.1.2
dwn_airflow-init_1 exited with code 0