当 运行 作为 k8s 集群中的作业时,Chaostoolkit 实验失败
Chaostoolkit experiment failing when run as a Job from within a k8s cluster
我正在使用 chaostoolkit 并且能够从命令行成功 运行 混沌实验。但是,当我尝试 运行 与 k8s 中的作业相同时,它会抛出 'connection refused' 错误。我觉得奇怪的是,有时稳态假设步骤 运行s 成功并且 returns 200 OK 而在终止 pod 操作失败时,但很多时候它也在假设步骤中失败本身(在终止 pod 的操作之前)。顺便说一下,我在 Google 云中执行此操作。
在某些 运行 期间,我看到操作前的假设以及终止 pod 都是成功的,但是操作后的假设(终止)得到 'connection refused' 错误。
任何 help/tip 不胜感激。
这是错误消息:
[2022-02-03 07:24:54 DEBUG] [caching:35] Cached 2 activities
[2022-02-03 07:24:54 INFO] [experiment:54] Validating the experiment's syntax
[2022-02-03 07:24:54 DEBUG] [configuration:47] Loading configuration...
[2022-02-03 07:24:54 DEBUG] [secret:74] Loading secrets...
[2022-02-03 07:24:54 DEBUG] [secret:89] Secrets loaded
[2022-02-03 07:25:12 INFO] [experiment:103] Experiment looks valid
[2022-02-03 07:25:12 DEBUG] [caching:42] Clearing activities cache
[2022-02-03 07:25:12 DEBUG] [caching:25] Building activity cache...
[2022-02-03 07:25:12 DEBUG] [caching:35] Cached 2 activities
[2022-02-03 07:25:12 INFO] [experiment:182] Running experiment: What happens if we terminate an instance of the application?
[2022-02-03 07:25:12 DEBUG] [configuration:47] Loading configuration...
[2022-02-03 07:25:12 DEBUG] [secret:74] Loading secrets...
[2022-02-03 07:25:12 DEBUG] [secret:89] Secrets loaded
[2022-02-03 07:25:12 DEBUG] [__init__:39] Initializing controls
[2022-02-03 07:25:12 DEBUG] [__init__:355] No controls to apply on 'experiment'
[2022-02-03 07:25:12 INFO] [hypothesis:184] Steady state hypothesis: The app is healthy
[2022-02-03 07:25:12 DEBUG] [__init__:355] No controls to apply on 'hypothesis'
[2022-02-03 07:25:12 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 07:25:12 INFO] [activity:160] Probe: app-responds-to-requests
[2022-02-03 07:25:12 DEBUG] [activity:233] Activity failed
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 156, in _new_conn
conn = connection.create_connection(
File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
这是我提供给作业的配置:
health-http.yaml: |
version: 1.0.0
title: What happens if we terminate an instance of the application?
description: If an instance of the application is terminated, the applications as a whole should still be operational.
tags:
- k8s
- pod
steady-state-hypothesis:
title: The app is healthy
probes:
- name: app-responds-to-requests
type: probe
tolerance: 200
provider:
type: http
timeout: 10
verify_tls: false
url: http://newapp
headers:
Host: newapp.example.com
method:
- type: action
name: terminate-app-pod
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: app=newapp
rand: true
ns: default
pauses:
after: 2
我能够通过 ssh 进入一个虚拟的 nginx pod 并且 'curl newapp' 并且它 returns 正确响应,所以该服务肯定是活跃的并且正在工作。我创建的服务帐户除其他权限外还具有获取、列出、删除 pods 的权限。
这是实验清单:
apiVersion: batch/v1
kind: Job
metadata:
name: newapp-chaos
spec:
activeDeadlineSeconds: 600
backoffLimit: 0
template:
metadata:
labels:
app: newapp
annotations:
sidecar.istio.io/inject: "false"
spec:
serviceAccountName: newapp-chaos
restartPolicy: Never
containers:
- name: chaostoolkit
image: vfarcic/chaostoolkit:1.4.1-2
args:
- --verbose
- run
- /experiment/health-http.yaml
env:
- name: CHAOSTOOLKIT_IN_POD
value: "true"
volumeMounts:
- name: config
mountPath: /experiment
readOnly: true
resources:
limits:
cpu: 20m
memory: 64Mi
requests:
cpu: 20m
memory: 64Mi
volumes:
- name: config
configMap:
name: newapp-config
这是我的应用清单:
apiVersion: apps/v1
kind: Deployment
metadata:
name: newapp-v2
spec:
replicas: 1
selector:
matchLabels:
app: newapp
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: newapp
version: v2
spec:
containers:
- image: rstarmer/hostname:v2
imagePullPolicy: Always
name: newapp
restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
labels:
app: newapp
name: newapp
spec:
#externalTrafficPolicy: Cluster
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: newapp
sessionAffinity: None
这里是终止也很好但后来遇到错误的输出:
[2022-02-03 09:43:22 INFO] [hypothesis:184] Steady state hypothesis: The app is healthy
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'hypothesis'
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 09:43:22 INFO] [activity:160] Probe: app-responds-to-requests
[2022-02-03 09:43:22 DEBUG] [activity:179] => succeeded with '{'status': 200, 'headers': {'Server': 'nginx/1.15.4', 'Date': 'Thu, 03 Feb 2022 09:43:22 GMT', 'Content-Type': 'text/html', 'Content-Length': '208', 'Last-Modified': 'Thu, 03 Feb 2022 07:21:47 GMT', 'Connection': 'keep-alive', 'ETag': '"61fb828b-d0"', 'Accept-Ranges': 'bytes'}, 'body': "<HTML>\n<HEAD>\n<TITLE>This page is on newapp-v2-866f8798cd-8s424 and is version v2</TITLE>\n</HEAD><BODY>\n<H1>THIS IS HOST newapp-v2-866f8798cd-8s424</H1>\n<H2>And we're running version: v2</H2>\n</BODY>\n</HTML>\n"}'
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 09:43:22 DEBUG] [hypothesis:212] allowed tolerance is 200
[2022-02-03 09:43:22 INFO] [hypothesis:222] Steady state hypothesis is met!
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'hypothesis'
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'method'
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 09:43:22 INFO] [activity:160] Action: terminate-app-pod
[2022-02-03 09:43:22 DEBUG] [python:34] Activity 'terminate-app-pod' loaded from '/usr/local/lib/python3.8/site-packages/chaosk8s/pod/actions.py'
[2022-02-03 09:43:23 DEBUG] [actions:193] Found 3 pods labelled 'app=newapp' in ns default
[2022-02-03 09:43:23 DEBUG] [activity:181] => succeeded without any result value
[2022-02-03 09:43:23 INFO] [activity:197] Pausing after activity for 2s...
[2022-02-03 09:43:25 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 09:43:25 DEBUG] [__init__:355] No controls to apply on 'method'
[2022-02-03 09:43:25 INFO] [hypothesis:184] Steady state hypothesis: The app is healthy
[2022-02-03 09:43:25 DEBUG] [__init__:355] No controls to apply on 'hypothesis'
[2022-02-03 09:43:25 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 09:43:25 INFO] [activity:160] Probe: app-responds-to-requests
[2022-02-03 09:43:25 DEBUG] [activity:233] Activity failed
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 156, in _new_conn
conn = connection.create_connection(
File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
在折腾了一天多之后,我尝试重新安装 Istio,它开始运行良好。那一定是 Istio 出了问题。
我正在使用 chaostoolkit 并且能够从命令行成功 运行 混沌实验。但是,当我尝试 运行 与 k8s 中的作业相同时,它会抛出 'connection refused' 错误。我觉得奇怪的是,有时稳态假设步骤 运行s 成功并且 returns 200 OK 而在终止 pod 操作失败时,但很多时候它也在假设步骤中失败本身(在终止 pod 的操作之前)。顺便说一下,我在 Google 云中执行此操作。
在某些 运行 期间,我看到操作前的假设以及终止 pod 都是成功的,但是操作后的假设(终止)得到 'connection refused' 错误。
任何 help/tip 不胜感激。
这是错误消息:
[2022-02-03 07:24:54 DEBUG] [caching:35] Cached 2 activities
[2022-02-03 07:24:54 INFO] [experiment:54] Validating the experiment's syntax
[2022-02-03 07:24:54 DEBUG] [configuration:47] Loading configuration...
[2022-02-03 07:24:54 DEBUG] [secret:74] Loading secrets...
[2022-02-03 07:24:54 DEBUG] [secret:89] Secrets loaded
[2022-02-03 07:25:12 INFO] [experiment:103] Experiment looks valid
[2022-02-03 07:25:12 DEBUG] [caching:42] Clearing activities cache
[2022-02-03 07:25:12 DEBUG] [caching:25] Building activity cache...
[2022-02-03 07:25:12 DEBUG] [caching:35] Cached 2 activities
[2022-02-03 07:25:12 INFO] [experiment:182] Running experiment: What happens if we terminate an instance of the application?
[2022-02-03 07:25:12 DEBUG] [configuration:47] Loading configuration...
[2022-02-03 07:25:12 DEBUG] [secret:74] Loading secrets...
[2022-02-03 07:25:12 DEBUG] [secret:89] Secrets loaded
[2022-02-03 07:25:12 DEBUG] [__init__:39] Initializing controls
[2022-02-03 07:25:12 DEBUG] [__init__:355] No controls to apply on 'experiment'
[2022-02-03 07:25:12 INFO] [hypothesis:184] Steady state hypothesis: The app is healthy
[2022-02-03 07:25:12 DEBUG] [__init__:355] No controls to apply on 'hypothesis'
[2022-02-03 07:25:12 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 07:25:12 INFO] [activity:160] Probe: app-responds-to-requests
[2022-02-03 07:25:12 DEBUG] [activity:233] Activity failed
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 156, in _new_conn
conn = connection.create_connection(
File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
这是我提供给作业的配置:
health-http.yaml: |
version: 1.0.0
title: What happens if we terminate an instance of the application?
description: If an instance of the application is terminated, the applications as a whole should still be operational.
tags:
- k8s
- pod
steady-state-hypothesis:
title: The app is healthy
probes:
- name: app-responds-to-requests
type: probe
tolerance: 200
provider:
type: http
timeout: 10
verify_tls: false
url: http://newapp
headers:
Host: newapp.example.com
method:
- type: action
name: terminate-app-pod
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: app=newapp
rand: true
ns: default
pauses:
after: 2
我能够通过 ssh 进入一个虚拟的 nginx pod 并且 'curl newapp' 并且它 returns 正确响应,所以该服务肯定是活跃的并且正在工作。我创建的服务帐户除其他权限外还具有获取、列出、删除 pods 的权限。
这是实验清单:
apiVersion: batch/v1
kind: Job
metadata:
name: newapp-chaos
spec:
activeDeadlineSeconds: 600
backoffLimit: 0
template:
metadata:
labels:
app: newapp
annotations:
sidecar.istio.io/inject: "false"
spec:
serviceAccountName: newapp-chaos
restartPolicy: Never
containers:
- name: chaostoolkit
image: vfarcic/chaostoolkit:1.4.1-2
args:
- --verbose
- run
- /experiment/health-http.yaml
env:
- name: CHAOSTOOLKIT_IN_POD
value: "true"
volumeMounts:
- name: config
mountPath: /experiment
readOnly: true
resources:
limits:
cpu: 20m
memory: 64Mi
requests:
cpu: 20m
memory: 64Mi
volumes:
- name: config
configMap:
name: newapp-config
这是我的应用清单:
apiVersion: apps/v1
kind: Deployment
metadata:
name: newapp-v2
spec:
replicas: 1
selector:
matchLabels:
app: newapp
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: newapp
version: v2
spec:
containers:
- image: rstarmer/hostname:v2
imagePullPolicy: Always
name: newapp
restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
labels:
app: newapp
name: newapp
spec:
#externalTrafficPolicy: Cluster
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: newapp
sessionAffinity: None
这里是终止也很好但后来遇到错误的输出:
[2022-02-03 09:43:22 INFO] [hypothesis:184] Steady state hypothesis: The app is healthy
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'hypothesis'
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 09:43:22 INFO] [activity:160] Probe: app-responds-to-requests
[2022-02-03 09:43:22 DEBUG] [activity:179] => succeeded with '{'status': 200, 'headers': {'Server': 'nginx/1.15.4', 'Date': 'Thu, 03 Feb 2022 09:43:22 GMT', 'Content-Type': 'text/html', 'Content-Length': '208', 'Last-Modified': 'Thu, 03 Feb 2022 07:21:47 GMT', 'Connection': 'keep-alive', 'ETag': '"61fb828b-d0"', 'Accept-Ranges': 'bytes'}, 'body': "<HTML>\n<HEAD>\n<TITLE>This page is on newapp-v2-866f8798cd-8s424 and is version v2</TITLE>\n</HEAD><BODY>\n<H1>THIS IS HOST newapp-v2-866f8798cd-8s424</H1>\n<H2>And we're running version: v2</H2>\n</BODY>\n</HTML>\n"}'
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 09:43:22 DEBUG] [hypothesis:212] allowed tolerance is 200
[2022-02-03 09:43:22 INFO] [hypothesis:222] Steady state hypothesis is met!
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'hypothesis'
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'method'
[2022-02-03 09:43:22 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 09:43:22 INFO] [activity:160] Action: terminate-app-pod
[2022-02-03 09:43:22 DEBUG] [python:34] Activity 'terminate-app-pod' loaded from '/usr/local/lib/python3.8/site-packages/chaosk8s/pod/actions.py'
[2022-02-03 09:43:23 DEBUG] [actions:193] Found 3 pods labelled 'app=newapp' in ns default
[2022-02-03 09:43:23 DEBUG] [activity:181] => succeeded without any result value
[2022-02-03 09:43:23 INFO] [activity:197] Pausing after activity for 2s...
[2022-02-03 09:43:25 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 09:43:25 DEBUG] [__init__:355] No controls to apply on 'method'
[2022-02-03 09:43:25 INFO] [hypothesis:184] Steady state hypothesis: The app is healthy
[2022-02-03 09:43:25 DEBUG] [__init__:355] No controls to apply on 'hypothesis'
[2022-02-03 09:43:25 DEBUG] [__init__:355] No controls to apply on 'activity'
[2022-02-03 09:43:25 INFO] [activity:160] Probe: app-responds-to-requests
[2022-02-03 09:43:25 DEBUG] [activity:233] Activity failed
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 156, in _new_conn
conn = connection.create_connection(
File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 84, in create_connection
raise err
File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 74, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
在折腾了一天多之后,我尝试重新安装 Istio,它开始运行良好。那一定是 Istio 出了问题。