尝试使用 EFS 在 AWS EKS（仅限 Fargate）上运行 Prometheus 时出现权限错误

Question

我有一个只有 Fargate 的 EKS 集群。我真的不想自己管理实例。我想将普罗米修斯部署到它——这需要一个持久卷。 As of two months ago this should be possible with EFS（托管 NFS 共享）我觉得我快到了，但我无法弄清楚当前的问题是什么

我做了什么：

设置 EKS Fargate 集群和合适的 Fargate 配置文件
设置具有适当安全组的 EFS
已安装 CSI 驱动程序并根据 AWS walkthough

目前一切顺利

我设置了持久卷声明（据我所知必须静态完成）：

kubectl apply -f pvc/

哪里

tree pvc/
pvc/
├── two_pvc.yml
└── ten_pvc.yml

和

cat pvc/*

apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv-two
spec:
  capacity:
    storage: 2Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-ec0e1234
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv-ten
spec:
  capacity:
    storage: 8Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-ec0e1234

然后

helm upgrade --install myrelease-helm-02 prometheus-community/prometheus \
    --namespace prometheus \
    --set alertmanager.persistentVolume.storageClass="efs-sc",server.persistentVolume.storageClass="efs-sc"

会发生什么？

prometheus alertmanager 的 pvc 运行良好。此部署的其他 pods 也是如此，但普罗米修斯服务器使用

进入 crashloopbackoff

invalid capacity 0 on filesystem

诊断

kubectl get pv -A
NAME                          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                               STORAGECLASS   REASON   AGE
efs-pv-ten                    8Gi        RWO            Retain           Bound      prometheus/myrelease-helm-02-prometheus-server         efs-sc                  11m
efs-pv-two                    2Gi        RWO            Retain           Bound      prometheus/myrelease-helm-02-prometheus-alertmanager   efs-sc                  11m

和

kubectl get pvc -A
NAMESPACE    NAME                                     STATUS   VOLUME       CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus   myrelease-helm-02-prometheus-alertmanager   Bound    efs-pv-two   2Gi        RWO            efs-sc         12m
prometheus   myrelease-helm-02-prometheus-server         Bound    efs-pv-ten   8Gi        RWO            efs-sc         12m

describe pod 只显示 'error'

最后，这个（来自同事）：

level=info ts=2020-10-09T15:17:08.898Z caller=main.go:346 msg="Starting Prometheus" version="(version=2.21.0, branch=HEAD, revision=e83ef207b6c2398919b69cd87d2693cfc2fb4127)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:347 build_context="(go=go1.15.2, user=root@a4d9bea8479e, date=20200911-11:35:02)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:348 host_details="(Linux 4.14.193-149.317.amzn2.x86_64 #1 SMP Thu Sep 3 19:04:44 UTC 2020 x86_64 myrelease-helm-02-prometheus-server-85765f9895-vxrkn (none))"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:349 fd_limits="(soft=1024, hard=4096)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:350 vm_limits="(soft=unlimited, hard=unlimited)"
level=error ts=2020-10-09T15:17:08.901Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/data/queries.active err="open /data/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker(0x7fffeb6e85ee, 0x5, 0x14, 0x30ca080, 0xc000d43620, 0x30ca080)
    /app/promql/query_logger.go:117 +0x4cf
main.main()
    /app/cmd/prometheus/main.go:377 +0x510c

除了出现权限问题之外，我感到困惑 - 我知道存储 'works' 并且可以访问 - 部署中的另一个 pod 似乎对此很满意 - 但这个不是。

Answer 1

现在工作 - 为了共同的利益而写在这里。感谢 /u/EmiiKhaos on reddit 提供的查找位置的建议

问题：

EFS 共享仅 root:root，prometheus 禁止运行宁 pods 作为 root。

解法：

为每个需要持久性的 pod 创建一个 EFS 访问点允许指定用户访问的卷。
为持久卷指定这些访问点
将合适的安全上下文应用到运行 pods 作为匹配用户

方法：

创建 2 个 EFS 访问点，例如：

{
    "Name": "prometheuserver",
    "AccessPointId": "fsap-<hex01>",
    "FileSystemId": "fs-ec0e1234",
    "PosixUser": {
        "Uid": 500,
        "Gid": 500,
        "SecondaryGids": [
            2000
        ]
    },
    "RootDirectory": {
        "Path": "/prometheuserver",
        "CreationInfo": {
            "OwnerUid": 500,
            "OwnerGid": 500,
            "Permissions": "0755"
        }
    }
},
{
    "Name": "prometheusalertmanager",
    "AccessPointId": "fsap-<hex02>",
    "FileSystemId": "fs-ec0e1234",
    "PosixUser": {
        "Uid": 501,
        "Gid": 501,
        "SecondaryGids": [
            2000
        ]
    },
    "RootDirectory": {
        "Path": "/prometheusalertmanager",
        "CreationInfo": {
            "OwnerUid": 501,
            "OwnerGid": 501,
            "Permissions": "0755"
        }
    }
}

更新我的持久卷：

kubectl apply -f pvc/

类似于：

apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheusalertmanager
spec:
  capacity:
    storage: 2Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-ec0e1234::fsap-<hex02>
---    
apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheusserver
spec:
  capacity:
    storage: 8Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-ec0e1234::fsap-<hex01>

Re-install普罗米修斯如前：

helm upgrade --install myrelease-helm-02 prometheus-community/prometheus \
    --namespace prometheus \
    --set alertmanager.persistentVolume.storageClass="efs-sc",server.persistentVolume.storageClass="efs-sc"

根据

进行有根据的猜测

kubectl describe pod myrelease-helm-02-prometheus-server -n prometheus

和

kubectl describe pod myrelease-helm-02-prometheus-alert-manager -n prometheus

关于设置安全上下文时需要指定哪个容器。然后使用适当的 uid:gid 将安全上下文应用于运行 pods，例如与

kubectl apply -f setpermissions/

哪里

cat setpermissions/*

给予

apiVersion: v1
kind: Pod
metadata:
  name: myrelease-helm-02-prometheus-alertmanager
spec:
  securityContext:
    runAsUser: 501
    runAsGroup: 501
    fsGroup: 501
  volumes:
    - name: prometheusalertmanager
  containers:
    - name: prometheusalertmanager
      image: jimmidyson/configmap-reload:v0.4.0
      securityContext:
        runAsUser: 501
        allowPrivilegeEscalation: false        
apiVersion: v1
kind: Pod
metadata:
  name: myrelease-helm-02-prometheus-server
spec:
  securityContext:
    runAsUser: 500
    runAsGroup: 500
    fsGroup: 500
  volumes:
    - name: prometheusserver
  containers:
    - name: prometheusserver
      image: jimmidyson/configmap-reload:v0.4.0
      securityContext:
        runAsUser: 500
        allowPrivilegeEscalation: false

尝试使用 EFS 在 AWS EKS（仅限 Fargate）上运行 Prometheus 时出现权限错误

Permissions error trying to run Prometheus on AWS EKS (Fargate only) with EFS

amazon-web-services

kubernetes

prometheus

amazon-efs

amazon-eks

尝试使用 EFS 在 AWS EKS（仅限 Fargate）上 运行 Prometheus 时出现权限错误

Permissions error trying to run Prometheus on AWS EKS (Fargate only) with EFS

amazon-web-services

kubernetes

prometheus

amazon-efs

amazon-eks

尝试使用 EFS 在 AWS EKS（仅限 Fargate）上运行 Prometheus 时出现权限错误