如何在 Kubernetes StatefulSet 中定义本地持久卷?

How to define local pesistence volumes in a Kubernetes StatefullSet?

在我的 Kubernetes 集群中,我想在每个节点上使用本地持久卷定义一个 StatefulSet。我的 Kubernetes 集群有工作节点。

我的 StatefulSet 看起来像这样:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: myset
spec:
  replicas: 3
  ...
  template:
    spec:
     ....
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - myset
              topologyKey: kubernetes.io/hostname
      containers:
     ....
        volumeMounts:
        - name: datadir
          mountPath: /data
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: datadir
    spec:
      accessModes:
        - "ReadWriteOnce"
      storageClassName: "local-storage"
      resources:
        requests:
          storage: 10Gi

我想实现,在每个 POD 上,运行 在一个单独的节点上,使用本地数据卷。

我定义了一个StorageClass对象:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

以及以下 PersistentVolume

apiVersion: v1
kind: PersistentVolume
metadata:
  name: datadir
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
  local:
    path: /var/lib/my-data/
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - worker-node-1

但是当然,这不起作用,因为我定义了一个 nodeAffinity,其中只有我的第一个 worker-node-1 的主机名。结果我只能看到一个 PV。相应节点上的 PVC 和 POD 正常启动。但是在其他两个节点上我没有 PV。我如何定义为每个工作节点创建本地 PersistenceVolume

我还尝试用 3 个值定义一个 nodeAffinity:

  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - worker-node-1
          - worker-node-2
          - worker-node-3

但这也行不通。

我建议不要在 PVC 定义中使用 nodeAffinity,而是在 statefulset 定义中使用 podAntiAffinity 规则来部署您的应用程序,这样就不会有两个实例位于同一主机上

所以你会有一个类似这样的有状态集定义:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: myset
spec:
  replicas: 3
  ...
  template:
    metadata:
      labels:
        sts: myset
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: sts
                operator: In
                values:
                - myset
            topologyKey: "kubernetes.io/hostname"
      containers:
     ....
        volumeMounts:
        - name: datadir
          mountPath: /data
      volumes:
      - name: datadir
        persistentVolumeClaim:
          claimName: datadir

参考:An example of a pod that uses pod affinity

I fear that the PersitenceVolume I define is the problem. This object will create exactly one PV and so only one of my PODs finds the corresponding PV and can be scheduled.

是的,你是对的。通过创建 PersistentVolume 对象,您正好创建了 one PersistentVolume。不多也不少。如果您定义了 3 个单独的 PVs 可以在您的 3 个节点中的每一个上使用,您应该不会遇到任何问题。

假设您有 3 个工作节点,则需要创建 3 个单独的 PersistentVolumes,每个 NodeAffinity您不需要在 StatefulSet 中定义任何 NodeAffinity,因为它已经在 PersistentVolume 级别上处理,并且应该只在此处定义。

如您在 local volume 文档中所读:

Compared to hostPath volumes, local volumes are used in a durable and portable manner without manually scheduling pods to nodes. The system is aware of the volume's node constraints by looking at the node affinity on the PersistentVolume.

记住:PVC -> PV 映射总是 1:1。您不能将 1 个 PVC 绑定到 3 个不同的 PV 或其他方式。

So my only solution is to switch form local PV to hostPath volumes which is working fine.

是的,可以用 hostpath 来完成,但我不会说这是唯一和最好的解决方案。与主机路径卷相比,本地卷有几个优势,值得考虑选择它们。但正如我上面提到的,在您的用例中,您需要手动创建 3 个单独的 PVs。您已经创建了一个 PV,因此创建另外两个应该没什么大不了的。这是要走的路。

I want to achieve, that on each POD, running on a separate node, a local data volume is used.

它可以通过本地卷来实现,但在这种情况下,而不是在您的 StatefulSet 定义中使用单个 PVC,如您的配置中的以下片段所示:

  volumes:
  - name: datadir
    persistentVolumeClaim:
      claimName: datadir

您只需要使用 volumeClaimTemplates,如 this 示例,可能如下所示:

  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "my-storage-class"
      resources:
        requests:
          storage: 1Gi

如您所见,PVCs 不会“寻找”具有任何特定名称的 PV,因此您可以随意命名它们。他们将“寻找”属于特定 StorageClassPV 并且在这种特定情况下支持 "ReadWriteOnce" accessMode.

调度程序将尝试找到合适的节点,您的有状态 pod 可以在该节点上进行调度。如果已经安排了另一个 pod,比方说,在 worker-1 上并且属于我们的 local-storage 存储 class 的唯一 PV 不再可用,调度程序将尝试找到另一个满足存储要求的节点。再说一遍:在您的 StatefulSet 定义中不需要节点亲和力/ pod 反亲和力规则。

But I need some mechanism that a PV is created for each node and assigned with the PODs created by the StatefulSet. But this did not work - I always have only one PV.

为了方便卷的管理,在一定程度上自动化整个过程,看一下Local Persistence Volume Static Provisioner。顾名思义,它不支持动态配置(就像我们在各种云平台上所做的那样),这意味着您仍然负责创建底层存储,但可以自动处理整个卷生命周期。

为了使整个理论解释更加实用,我在下面添加了一个工作示例,您可以自己快速测试一下。确保在每个节点上都创建了 /var/tmp/test 目录或根据您的需要调整以下示例:

StatefulSet 个组件(here 中的示例稍作修改):

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 3 # by default is 1
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "local-storage"
      resources:
        requests:
          storage: 1Gi

StorageClass定义:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer

最后 PV。您需要通过设置不同的名称来制作以下 yaml 清单的 3 个版本,例如example-pv-1example-pv-2example-pv-3 以及节点名称。

apiVersion: v1
kind: PersistentVolume
metadata:
  name: example-pv-1 ###  change it
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-storage
  local:
    path: /var/tmp/test ###  you can adjust shared directory on the node 
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - worker-node-1 ###  change this value by setting your node name

所以 3 个工作节点有 3 个不同的 PVs