尝试使用 EFS 在 AWS EKS(仅限 Fargate)上 运行 Prometheus 时出现权限错误
Permissions error trying to run Prometheus on AWS EKS (Fargate only) with EFS
我有一个只有 Fargate 的 EKS 集群。我真的不想自己管理实例。我想将普罗米修斯部署到它——这需要一个持久卷。 As of two months ago this should be possible with EFS(托管 NFS 共享)我觉得我快到了,但我无法弄清楚当前的问题是什么
我做了什么:
- 设置 EKS Fargate 集群和合适的 Fargate 配置文件
- 设置具有适当安全组的 EFS
- 已安装 CSI 驱动程序并根据 AWS walkthough
验证了 EFS
目前一切顺利
我设置了持久卷声明(据我所知必须静态完成):
kubectl apply -f pvc/
哪里
tree pvc/
pvc/
├── two_pvc.yml
└── ten_pvc.yml
和
cat pvc/*
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-pv-two
spec:
capacity:
storage: 2Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: fs-ec0e1234
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-pv-ten
spec:
capacity:
storage: 8Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: fs-ec0e1234
然后
helm upgrade --install myrelease-helm-02 prometheus-community/prometheus \
--namespace prometheus \
--set alertmanager.persistentVolume.storageClass="efs-sc",server.persistentVolume.storageClass="efs-sc"
会发生什么?
prometheus alertmanager 的 pvc 运行良好。此部署的其他 pods 也是如此,但普罗米修斯服务器使用
进入 crashloopbackoff
invalid capacity 0 on filesystem
诊断
kubectl get pv -A
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
efs-pv-ten 8Gi RWO Retain Bound prometheus/myrelease-helm-02-prometheus-server efs-sc 11m
efs-pv-two 2Gi RWO Retain Bound prometheus/myrelease-helm-02-prometheus-alertmanager efs-sc 11m
和
kubectl get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
prometheus myrelease-helm-02-prometheus-alertmanager Bound efs-pv-two 2Gi RWO efs-sc 12m
prometheus myrelease-helm-02-prometheus-server Bound efs-pv-ten 8Gi RWO efs-sc 12m
describe pod
只显示 'error'
最后,这个(来自同事):
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:346 msg="Starting Prometheus" version="(version=2.21.0, branch=HEAD, revision=e83ef207b6c2398919b69cd87d2693cfc2fb4127)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:347 build_context="(go=go1.15.2, user=root@a4d9bea8479e, date=20200911-11:35:02)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:348 host_details="(Linux 4.14.193-149.317.amzn2.x86_64 #1 SMP Thu Sep 3 19:04:44 UTC 2020 x86_64 myrelease-helm-02-prometheus-server-85765f9895-vxrkn (none))"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:349 fd_limits="(soft=1024, hard=4096)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:350 vm_limits="(soft=unlimited, hard=unlimited)"
level=error ts=2020-10-09T15:17:08.901Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/data/queries.active err="open /data/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker(0x7fffeb6e85ee, 0x5, 0x14, 0x30ca080, 0xc000d43620, 0x30ca080)
/app/promql/query_logger.go:117 +0x4cf
main.main()
/app/cmd/prometheus/main.go:377 +0x510c
除了出现权限问题之外,我感到困惑 - 我知道存储 'works' 并且可以访问 - 部署中的另一个 pod 似乎对此很满意 - 但这个不是。
现在工作 - 为了共同的利益而写在这里。感谢 /u/EmiiKhaos on reddit 提供的查找位置的建议
问题:
EFS 共享仅 root:root
,prometheus 禁止 运行宁 pods 作为 root。
解法:
- 为每个需要持久性的 pod 创建一个 EFS 访问点
允许指定用户访问的卷。
- 为持久卷指定这些访问点
- 将合适的安全上下文应用到 运行 pods 作为匹配用户
方法:
创建 2 个 EFS 访问点,例如:
{
"Name": "prometheuserver",
"AccessPointId": "fsap-<hex01>",
"FileSystemId": "fs-ec0e1234",
"PosixUser": {
"Uid": 500,
"Gid": 500,
"SecondaryGids": [
2000
]
},
"RootDirectory": {
"Path": "/prometheuserver",
"CreationInfo": {
"OwnerUid": 500,
"OwnerGid": 500,
"Permissions": "0755"
}
}
},
{
"Name": "prometheusalertmanager",
"AccessPointId": "fsap-<hex02>",
"FileSystemId": "fs-ec0e1234",
"PosixUser": {
"Uid": 501,
"Gid": 501,
"SecondaryGids": [
2000
]
},
"RootDirectory": {
"Path": "/prometheusalertmanager",
"CreationInfo": {
"OwnerUid": 501,
"OwnerGid": 501,
"Permissions": "0755"
}
}
}
更新我的持久卷:
kubectl apply -f pvc/
类似于:
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheusalertmanager
spec:
capacity:
storage: 2Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: fs-ec0e1234::fsap-<hex02>
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheusserver
spec:
capacity:
storage: 8Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: fs-ec0e1234::fsap-<hex01>
Re-install普罗米修斯如前:
helm upgrade --install myrelease-helm-02 prometheus-community/prometheus \
--namespace prometheus \
--set alertmanager.persistentVolume.storageClass="efs-sc",server.persistentVolume.storageClass="efs-sc"
根据
进行有根据的猜测
kubectl describe pod myrelease-helm-02-prometheus-server -n prometheus
和
kubectl describe pod myrelease-helm-02-prometheus-alert-manager -n prometheus
关于设置安全上下文时需要指定哪个容器。然后使用适当的 uid:gid
将安全上下文应用于 运行 pods,例如与
kubectl apply -f setpermissions/
哪里
cat setpermissions/*
给予
apiVersion: v1
kind: Pod
metadata:
name: myrelease-helm-02-prometheus-alertmanager
spec:
securityContext:
runAsUser: 501
runAsGroup: 501
fsGroup: 501
volumes:
- name: prometheusalertmanager
containers:
- name: prometheusalertmanager
image: jimmidyson/configmap-reload:v0.4.0
securityContext:
runAsUser: 501
allowPrivilegeEscalation: false
apiVersion: v1
kind: Pod
metadata:
name: myrelease-helm-02-prometheus-server
spec:
securityContext:
runAsUser: 500
runAsGroup: 500
fsGroup: 500
volumes:
- name: prometheusserver
containers:
- name: prometheusserver
image: jimmidyson/configmap-reload:v0.4.0
securityContext:
runAsUser: 500
allowPrivilegeEscalation: false
我有一个只有 Fargate 的 EKS 集群。我真的不想自己管理实例。我想将普罗米修斯部署到它——这需要一个持久卷。 As of two months ago this should be possible with EFS(托管 NFS 共享)我觉得我快到了,但我无法弄清楚当前的问题是什么
我做了什么:
- 设置 EKS Fargate 集群和合适的 Fargate 配置文件
- 设置具有适当安全组的 EFS
- 已安装 CSI 驱动程序并根据 AWS walkthough 验证了 EFS
目前一切顺利
我设置了持久卷声明(据我所知必须静态完成):
kubectl apply -f pvc/
哪里
tree pvc/
pvc/
├── two_pvc.yml
└── ten_pvc.yml
和
cat pvc/*
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-pv-two
spec:
capacity:
storage: 2Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: fs-ec0e1234
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-pv-ten
spec:
capacity:
storage: 8Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: fs-ec0e1234
然后
helm upgrade --install myrelease-helm-02 prometheus-community/prometheus \
--namespace prometheus \
--set alertmanager.persistentVolume.storageClass="efs-sc",server.persistentVolume.storageClass="efs-sc"
会发生什么?
prometheus alertmanager 的 pvc 运行良好。此部署的其他 pods 也是如此,但普罗米修斯服务器使用
进入 crashloopbackoffinvalid capacity 0 on filesystem
诊断
kubectl get pv -A
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
efs-pv-ten 8Gi RWO Retain Bound prometheus/myrelease-helm-02-prometheus-server efs-sc 11m
efs-pv-two 2Gi RWO Retain Bound prometheus/myrelease-helm-02-prometheus-alertmanager efs-sc 11m
和
kubectl get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
prometheus myrelease-helm-02-prometheus-alertmanager Bound efs-pv-two 2Gi RWO efs-sc 12m
prometheus myrelease-helm-02-prometheus-server Bound efs-pv-ten 8Gi RWO efs-sc 12m
describe pod
只显示 'error'
最后,这个(来自同事):
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:346 msg="Starting Prometheus" version="(version=2.21.0, branch=HEAD, revision=e83ef207b6c2398919b69cd87d2693cfc2fb4127)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:347 build_context="(go=go1.15.2, user=root@a4d9bea8479e, date=20200911-11:35:02)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:348 host_details="(Linux 4.14.193-149.317.amzn2.x86_64 #1 SMP Thu Sep 3 19:04:44 UTC 2020 x86_64 myrelease-helm-02-prometheus-server-85765f9895-vxrkn (none))"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:349 fd_limits="(soft=1024, hard=4096)"
level=info ts=2020-10-09T15:17:08.898Z caller=main.go:350 vm_limits="(soft=unlimited, hard=unlimited)"
level=error ts=2020-10-09T15:17:08.901Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/data/queries.active err="open /data/queries.active: permission denied"
panic: Unable to create mmap-ed active query log
goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker(0x7fffeb6e85ee, 0x5, 0x14, 0x30ca080, 0xc000d43620, 0x30ca080)
/app/promql/query_logger.go:117 +0x4cf
main.main()
/app/cmd/prometheus/main.go:377 +0x510c
除了出现权限问题之外,我感到困惑 - 我知道存储 'works' 并且可以访问 - 部署中的另一个 pod 似乎对此很满意 - 但这个不是。
现在工作 - 为了共同的利益而写在这里。感谢 /u/EmiiKhaos on reddit 提供的查找位置的建议
问题:
EFS 共享仅 root:root
,prometheus 禁止 运行宁 pods 作为 root。
解法:
- 为每个需要持久性的 pod 创建一个 EFS 访问点 允许指定用户访问的卷。
- 为持久卷指定这些访问点
- 将合适的安全上下文应用到 运行 pods 作为匹配用户
方法:
创建 2 个 EFS 访问点,例如:
{
"Name": "prometheuserver",
"AccessPointId": "fsap-<hex01>",
"FileSystemId": "fs-ec0e1234",
"PosixUser": {
"Uid": 500,
"Gid": 500,
"SecondaryGids": [
2000
]
},
"RootDirectory": {
"Path": "/prometheuserver",
"CreationInfo": {
"OwnerUid": 500,
"OwnerGid": 500,
"Permissions": "0755"
}
}
},
{
"Name": "prometheusalertmanager",
"AccessPointId": "fsap-<hex02>",
"FileSystemId": "fs-ec0e1234",
"PosixUser": {
"Uid": 501,
"Gid": 501,
"SecondaryGids": [
2000
]
},
"RootDirectory": {
"Path": "/prometheusalertmanager",
"CreationInfo": {
"OwnerUid": 501,
"OwnerGid": 501,
"Permissions": "0755"
}
}
}
更新我的持久卷:
kubectl apply -f pvc/
类似于:
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheusalertmanager
spec:
capacity:
storage: 2Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: fs-ec0e1234::fsap-<hex02>
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheusserver
spec:
capacity:
storage: 8Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: fs-ec0e1234::fsap-<hex01>
Re-install普罗米修斯如前:
helm upgrade --install myrelease-helm-02 prometheus-community/prometheus \
--namespace prometheus \
--set alertmanager.persistentVolume.storageClass="efs-sc",server.persistentVolume.storageClass="efs-sc"
根据
进行有根据的猜测kubectl describe pod myrelease-helm-02-prometheus-server -n prometheus
和
kubectl describe pod myrelease-helm-02-prometheus-alert-manager -n prometheus
关于设置安全上下文时需要指定哪个容器。然后使用适当的 uid:gid
将安全上下文应用于 运行 pods,例如与
kubectl apply -f setpermissions/
哪里
cat setpermissions/*
给予
apiVersion: v1
kind: Pod
metadata:
name: myrelease-helm-02-prometheus-alertmanager
spec:
securityContext:
runAsUser: 501
runAsGroup: 501
fsGroup: 501
volumes:
- name: prometheusalertmanager
containers:
- name: prometheusalertmanager
image: jimmidyson/configmap-reload:v0.4.0
securityContext:
runAsUser: 501
allowPrivilegeEscalation: false
apiVersion: v1
kind: Pod
metadata:
name: myrelease-helm-02-prometheus-server
spec:
securityContext:
runAsUser: 500
runAsGroup: 500
fsGroup: 500
volumes:
- name: prometheusserver
containers:
- name: prometheusserver
image: jimmidyson/configmap-reload:v0.4.0
securityContext:
runAsUser: 500
allowPrivilegeEscalation: false