OpenShift allows you to run – among the Virtual Machines – containerised workloads. One of the biggest benefit of containers is the ability to turn large projects into small containers (microservices), what enables you to develop and manage them independently from each other. One of the most important aspects of service management is to scale it accordingly to the load to make its users happy. You can try to predict the load and provide enough resources to handle it or you can rely on automated scaling. Manual scaling can be challenging and often can cause under or over capacity – situation where you provide too much or not enough resources to handle the load. Both situations can cause lost in reputation, budget or both together. This is where automatic scaling comes with help! 🙂
OpenShift provides the following built-in autoscaling solutions:
- HorizontalPodAutoScaler (HPA) – adds or removes pods based on simple CPU/Memory usage metrics
- VerticalPodAutoScaler (VPA) – updates the resource limits and requests accordingly to historic and current CPU and memory usage
- Custom Metrics Autoscaler Operator – increase or decrease the number of pods based on custom metrics (other than only Memory or CPU)
In this post I will focus on the first one – HorizontalPodAutoScaler (HPA).
Metrics in OpenShift
OpenShift out of the box collects CPU and memory usage metrics from the running workloads. You can easily view them running oc adm top pods
or oc describe PodMetrics
commands, for an instance:
$ oc adm top pods
NAME CPU(cores) MEMORY(bytes)
myapp-95bb75667-hq7fk 230m 15Mi
$ oc describe PodMetrics myapp-95bb75667-hq7fk
(...)
Containers:
Name: myapp
Usage:
Cpu: 230m
Memory: 15768Ki
(...)
Additionally you can see simple graphs in OpenShift WebUI:
Creating “MyApp” test workload
To have some fun with HPA let’s create an example application. To make things easier I created the myapp
image from the following simple Containerfile:
FROM fedora:38
RUN dnf install -y stress-ng pv
CMD ["/usr/bin/sleep", "infinity"]
$ podman build --arch x86_64 -t default-route-openshift-image-registry.apps.ocp4.openshift.one:443/rafal-hpa/myapp:v1.0 .
STEP 1/3: FROM fedora:38
STEP 2/3: RUN dnf install -y stress-ng pv
Fedora 38 - x86_64 4.5 MB/s | 83 MB 00:18
Fedora 38 openh264 (From Cisco) - x86_64 1.3 kB/s | 2.5 kB 00:01
Fedora Modular 38 - x86_64 1.3 MB/s | 2.8 MB 00:02
Fedora 38 - x86_64 - Updates 2.1 MB/s | 24 MB 00:11
Fedora Modular 38 - x86_64 - Updates 1.2 MB/s | 2.1 MB 00:01
Last metadata expiration check: 0:00:02 ago on Thu Jun 15 09:31:49 2023.
Dependencies resolved.
================================================================================
Package Architecture Version Repository Size
================================================================================
Installing:
pv x86_64 1.6.20-6.fc38 fedora 66 k
stress-ng x86_64 0.15.06-1.fc38 fedora 2.4 M
Installing dependencies:
Judy x86_64 1.0.5-31.fc38 fedora 132 k
libbsd x86_64 0.11.7-4.fc38 fedora 112 k
libmd x86_64 1.0.4-3.fc38 fedora 39 k
lksctp-tools x86_64 1.0.19-3.fc38 fedora 92 k
Transaction Summary
================================================================================
Install 6 Packages
Total download size: 2.8 M
Installed size: 10 M
Downloading Packages:
(1/6): libbsd-0.11.7-4.fc38.x86_64.rpm 634 kB/s | 112 kB 00:00
(2/6): Judy-1.0.5-31.fc38.x86_64.rpm 414 kB/s | 132 kB 00:00
(3/6): libmd-1.0.4-3.fc38.x86_64.rpm 117 kB/s | 39 kB 00:00
(4/6): lksctp-tools-1.0.19-3.fc38.x86_64.rpm 612 kB/s | 92 kB 00:00
(5/6): pv-1.6.20-6.fc38.x86_64.rpm 2.1 MB/s | 66 kB 00:00
(6/6): stress-ng-0.15.06-1.fc38.x86_64.rpm 4.5 MB/s | 2.4 MB 00:00
--------------------------------------------------------------------------------
Total 1.8 MB/s | 2.8 MB 00:01
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : lksctp-tools-1.0.19-3.fc38.x86_64 1/6
Installing : libmd-1.0.4-3.fc38.x86_64 2/6
Installing : libbsd-0.11.7-4.fc38.x86_64 3/6
Installing : Judy-1.0.5-31.fc38.x86_64 4/6
Installing : stress-ng-0.15.06-1.fc38.x86_64 5/6
Installing : pv-1.6.20-6.fc38.x86_64 6/6
Running scriptlet: pv-1.6.20-6.fc38.x86_64 6/6
Verifying : Judy-1.0.5-31.fc38.x86_64 1/6
Verifying : libbsd-0.11.7-4.fc38.x86_64 2/6
Verifying : libmd-1.0.4-3.fc38.x86_64 3/6
Verifying : lksctp-tools-1.0.19-3.fc38.x86_64 4/6
Verifying : pv-1.6.20-6.fc38.x86_64 5/6
Verifying : stress-ng-0.15.06-1.fc38.x86_64 6/6
Installed:
Judy-1.0.5-31.fc38.x86_64 libbsd-0.11.7-4.fc38.x86_64
libmd-1.0.4-3.fc38.x86_64 lksctp-tools-1.0.19-3.fc38.x86_64
pv-1.6.20-6.fc38.x86_64 stress-ng-0.15.06-1.fc38.x86_64
Complete!
--> cb48d092d119
STEP 3/3: CMD ["/usr/bin/sleep", "infinity"]
COMMIT myapp:v1.0
--> af37513c7c54
Successfully tagged localhost/myapp:v1.0
af37513c7c542a624b74f578b0ec3a54ac63b3e9b01e27019e8a67e718d2eb07
Push it to my OpenShift’s internal registry:
$ podman push default-route-openshift-image-registry.apps.ocp4.openshift.one:443/rafal-hpa/myapp:v1.0
Getting image source signatures
Copying blob sha256:fb7b7e1a70dd14da904e1a241e8ed152ed9cc7153c9bd16f95db33e77891da6b
Copying blob sha256:dda8af7b00b7fe3b1c22f7b09aace1a2d0a32018905f3beaaa53f45ad97a3646
Copying config sha256:af37513c7c542a624b74f578b0ec3a54ac63b3e9b01e27019e8a67e718d2eb07
Writing manifest to image destination
Storing signatures
So now I can create a Deployment from it:
$ oc new-app --image default-route-openshift-image-registry.apps.ocp4.openshift.one/rafal-hpa/myapp:v1.0 --name myapp --insecure-registry=true
--> Found container image af37513 (2 hours old) from default-route-openshift-image-registry.apps.ocp4.openshift.one for "default-route-openshift-image-registry.apps.ocp4.openshift.one/rafal-hpa/myapp:v1.0"
* An image stream tag will be created as "myapp:v1.0" that will track this image
--> Creating resources ...
deployment.apps "myapp" created
--> Success
Run 'oc status' to view your app.
At this stage I got a single container running controlled by the Deployment
$ oc get pods,deployments
NAME READY STATUS RESTARTS AGE
pod/myapp-77dbd8bc94-v2fjc 1/1 Running 0 94s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/myapp 1/1 1 1 95s
Since it is very small and simple it does not use any significant amount of resources for now
$ oc adm top pods
ocNAME CPU(cores) MEMORY(bytes)
myapp-77dbd8bc94-v2fjc 0m 0Mi
$ oc describe podmetrics myapp-77dbd8bc94-v2fjc
Name: myapp-77dbd8bc94-v2fjc
Namespace: rafal-hpa
Containers:
Name: myapp
Usage:
Cpu: 0
Memory: 328Ki
Kind: PodMetrics
Events: <none>
Requests and Limits
For each compute resource, a container may specify a resource Request and Limit.
Request are being used during scheduling to find suitable compute node which can provide requested amount of resources (CPU, memory). They are also being used by HorizontalPodAutoscaler to calculate the current resource usage vs expected usage expressed in percents. Container can go above values described by Requests but their availability is not guaranteed, so it may happen it won’t be able to get more CPU or memory on the current node.
Limits on the other hand specify the maximum amount of resources (CPU, memory) that container may consume. Container won’t be able to use more than Limit specifies.
If one configures Limits but omits Requests, Request will be automatically configured with the Limit value.
You have to set Requests if you want to configure HorizontalPodAutoscaler to scale your application based on the percentage of resource usage. It is not required though if you use exact value to describe memory or CPU usage, such like 500m (milicores) or 256Mi (megabinary).
More detailed definition of Requests and Limits can be found in this document: Resource requests and overcommitment
For my example MyApp I will set requests to 500m of CPU and 128Mi of memory and set the limits to 1000m CPU and 256Mi of memory. These settings mean that scheduler will try to find a node which has at least 500m CPU and 128Mi of memory available, these values will be also taken into account in case of HPA configured to use percentage (%) of the requested values. Additionally my example MyApp will be capped at 1000m CPU (one core) and 256Mi of memory.
$ oc set resources deployment myapp --requests=cpu=500m,memory=128Mi --limits=cpu=1000m,memory=256Mi
deployment.apps/myapp resource requirements updated
Autoscaling based on CPU usage
OpenShift CLI (oc) comes with handy extension capable to set CPU based Horizontal pod auto-scaling straight from the command line (please note it is only capable now to configure CPU based HPA, no option for memory is available at the time of writing it where oc CLI version is 4.13.4):
oc autoscale (-f FILENAME | TYPE NAME | TYPE/NAME) [--min=MINPODS] --max=MAXPODS [--cpu-percent=CPU] [options]
Having in mind that in the previous step I configured CPU requests for 100m cores I want to scale-out my deployment where the average usage of CPU across all running pods will go above 50% so 250m core. I also want to ensure there will be at least two replicas on my application running for availability reasons and it won’t go above 10 replicas.
$ oc autoscale deployment myapp --min=2 --max=10 --cpu-percent=50
horizontalpodautoscaler.autoscaling/myapp autoscaled
Immediately after running the command above OpenShift will start additional running copy of my pod to satisfy requirement of minimum 2 replicas running:
$ oc get pods
NAME READY STATUS RESTARTS AGE
myapp-6677fd6f55-pdq44 1/1 Running 0 27s
myapp-66d77bbf56-5g9lj 1/1 Running 0 20m
After a short while it will also start monitoring CPU usage and report it under HPA object:
$ oc get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
myapp Deployment/myapp 0%/50% 2 10 2 2m45s
$ oc describe hpa
Name: myapp
Namespace: rafal-hpa
Labels: <none>
Annotations: <none>
CreationTimestamp: Wed, 21 Jun 2023 14:58:20 +0200
Reference: Deployment/myapp
Metrics: ( current / target )
resource cpu on pods (as a percentage of request): 0% (0) / 50%
Min replicas: 2
Max replicas: 10
Deployment pods: 2 current / 2 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ScaleDownStabilized recent recommendations were higher than current one, applying the highest recent recommendation
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulRescale 2m32s horizontal-pod-autoscaler New size: 2; reason: Current number of replicas below Spec.MinReplicas
Since the pod I run just sleeps (remember CMD ["/usr/bin/sleep", "infinity"]
from the Containerfile? What a life! 🙂 ) it reports 0% of usage. Let’s put some load there to wake up HPA (please note change of the directory to /tmp – stress-ng needs write permissions to the current directory):
$ oc rsh myapp-6677fd6f55-pdq44
sh-5.2$ cd /tmp
sh-5.2$ stress-ng -c 1
stress-ng: info: [19] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info: [19] dispatching hogs: 1 cpu
In another terminal run oc get -w hpa to watch HPA status changes:
$ oc get hpa -w
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
myapp Deployment/myapp 0%/50% 2 10 2 44m
myapp Deployment/myapp 60%/50% 2 10 2 45m
myapp Deployment/myapp 60%/50% 2 10 3 45m
myapp Deployment/myapp 40%/50% 2 10 5 45m
myapp Deployment/myapp 46%/50% 2 10 5 45m
By default HPA has scale-down policy configured to wait 300 seconds before pods will be removed. This is just to avoid unnecessary ping-ping while adding and removing pods just because load fluctuates a bit. For demo purposes I modified this default policy and configured it to 15 seconds. Therefore I don’t have to wait too long once load decreases to see HPA removing the pods.
$ oc patch hpa myapp -p '{"spec": {"behavior": {"scaleDown": {"stabilizationWindowSeconds": 15 }}}}'
horizontalpodautoscaler.autoscaling/myapp patched
I cancelled stress-ng process started earlier so the load will go down and HPA remove all extra pods, keeping just 2 of them as requested.
myapp Deployment/myapp 48%/50% 2 10 5 46m
myapp Deployment/myapp 35%/50% 2 10 5 47m
myapp Deployment/myapp 0%/50% 2 10 2 47m
This concludes the CPU auto-scaling exercise. Please remember you can also track metrics and events using WebUI or oc get events -w
command among the others.
Autoscaling based on Memory usage
Autoscaling based on memory works in similar fashion as CPU based, however oc CLI tool does not provide option to set it up straight from the command line. Therefore my approach here is firstly create autoscaler with minimum and maximum number of pods and then edit it to add memory based scaling. For an instance:
$ oc autoscale deployment myapp --min=2 --max=10 -o yaml --dry-run=client
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
creationTimestamp: null
name: myapp
spec:
maxReplicas: 10
minReplicas: 2
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
status:
currentReplicas: 0
desiredReplicas: 0
and change it by adding spec.metrics section as bellow:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
creationTimestamp: null
name: myapp
spec:
maxReplicas: 10
minReplicas: 2
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
metrics:
- resource:
name: memory
target:
averageUtilization: 50
type: Utilization
type: Resource
The above HPA example will trigger scale-out action of myapp
Deployment if an average memory utilisation across all pods being managed by the deployment will go above 50% of requested memory. Please remember to set memory requests accordingly. For this example I set it the same way as before:
$ oc set resources deployment myapp --requests=cpu=500m,memory=128Mi --limits=cpu=1000m,memory=256Mi
deployment.apps/myapp resource requirements updated
Let’s give it a try and allocate 200Mi of memory on one of the pods within myapp Deployment for 5 minutes:
$ oc get pods
NAME READY STATUS RESTARTS AGE
myapp-6677fd6f55-5622v 1/1 Running 0 7m
myapp-6677fd6f55-hxsrn 1/1 Running 0 7m
$ oc rsh myapp-6677fd6f55-5622v
sh-5.2$ cat <( </dev/zero head -c 200m) <(sleep 300) | tail
If you’re still watching HPA you should be able to observe it notices increase of memory usage and scale out Deployment by adding additional pods either to the upper limit or until average memory usage drops bellow 50% of the request, then after 300 seconds (5 minutes) when memory usage drops it automatically scale-in Deployment by reducing number of pods (replicas).
$ oc get hpa myapp -w
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
myapp Deployment/myapp 0%/50% 2 10 2 10m
myapp Deployment/myapp 35%/50% 2 10 2 11m
myapp Deployment/myapp 79%/50% 2 10 2 11m
myapp Deployment/myapp 79%/50% 2 10 4 12m
myapp Deployment/myapp 53%/50% 2 10 4 12m
myapp Deployment/myapp 40%/50% 2 10 4 13m
myapp Deployment/myapp 40%/50% 2 10 4 14m
myapp Deployment/myapp 0%/50% 2 10 4 15m
myapp Deployment/myapp 0%/50% 2 10 2 15m
How does HorizontalPodAutoscaler calculate % of resource usage?
Until now we’ve been using percentage based resource usage against CPU and memory requests but it wasn’t explain yet how does it calculate it. Here is the rule:
$TOTAL_USED / $REQUESTED / $NUM_OF_PODS = used_resources%
For an instance in the last example I configured memory request to 128Mi, number of replicas to 2 and then allocated 200Mi of memory, therefore:
200 / 128 / 2 = .78
and that gave us 79% of memory usage (including memory used by “sleeping” pods). The same rule applies for CPU usage calculations.
How does HorizontalPodAutoscaler calculate number of required pods?
To calculate the ratio how many pods there should be running to meet HPA configuration criteria, the following rule is being used:
new_number_of_pods = ceil( current_number_of_pods * ( currentMetricValue / desiredMetricValue ) )
where currentMetricValue
in case of averageUtilization
being used is calculated as an average resource usage across all the pods. To put it into example:
Two pods running, 200Mi memory being used, request is for 128Mi and threshold is set to 50% give us:
- current_number_of_pods:
2
- currentMetricValue:
200 / 2 = 100
- desiredMetricValue:
128 * 0.5 = 64
- new_number_of_pods:
ceil( 2 * ( 100 / 64 ) ) = ceil( 3.125 ) = 4
Therefore to address the HPA configuration requirement of 50% of requested memory being used on average, the HPA should scale-out the deployment to 4 total pods.
Once it is scaled out it looks as follows:
- current_number_of_pods:
4
- currentMetricValue:
200 / 4 = 50
- desiredMetricValue:
128 * 0.5 = 64
- new_number_of_pods:
ceil( 4 * ( 50 / 64 ) ) = ceil( 3.125 ) = 4
So there is no need to scale out or scale-in since the number of pods is right to address 50% of 128Mi being used on average.
If the memory usage drops at some point to, let say, 100Mi:
- current_number_of_pods:
4
- currentMetricValue:
100 / 4 = 25
- desiredMetricValue:
128 * 0.5 = 64
- new_number_of_pods:
ceil( 4 * ( 25 / 64 ) ) = ceil( 1.5625 ) =
2
Therefore deployment can be scaled-in back to 2 pods.