Memory Issues in NextGen Gateway Pods

This guide provides a step-by-step process to diagnose and resolve high memory issues causing NextGen Gateway pods to crash in a Kubernetes environment. It includes commands to check the pod status, identify memory-related issues, and implement solutions to stabilize the pod.

Verify the Memory usage if Pod Crashes due to Memory Issue

To verify the memory usage in Kubernetes pods, make sure that you have enabled the metrics server in the Kubernetes cluster. Kubectl top command can be used to retrieve snapshots of resource utilization of pods or nodes in your Kubernetes cluster.

Use the below command to verify POD memory usage.

$ kubectl top pods
NAME           CPU(cores)   MEMORY(bytes)  
nextgen-gw-0   48m          1375Mi

Use the below command to verify Node memory usage.

$ kubectl top nodes
NAME              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%  
nextgen-gateway   189m         9%     3969Mi          49%

NextGen Gateway pod Crashed due to High Memory Usage

The NextGen Gateway pod in a Kubernetes cluster crashes due to high memory usage.

Possible Causes

When a pod exceeds its allocated memory, the Kubernetes system automatically kills the process to protect the node’s stability, resulting in an “OOMKilled” (Out of Memory Killed) error. This is particularly critical for the NextGen Gateway, as it may affect the stability and monitoring capabilities of the OpsRamp platform.

Troubleshooting Steps

Follow these steps to diagnose and fix memory issues for the NextGen Gateway pod:

Check the status of Kubernetes objects to determine if pods are running or not.
Use the following command to gather detailed information about the pod. This will include the status, restart count, and the reason for any previous restarts.
```
kubectl describe pod <pod_name>
```
For example:
```
kubectl describe pod nextgen-gw-0
```

Look for memory-related termination reasons in the pod’s event logs.

Sample output of logs:

vprobe:
    Container ID:   containerd://40c8585cf88dc7d0dd4e43560dc631ef559b0c92e6d5d429719a384aaea77777
    Image:          us-central1-docker.pkg.dev/opsramp-registry/gateway-cluster-images/vprobe:17.0.0
    Image ID:       us-central1-docker.pkg.dev/opsramp-registry/gateway-cluster-images/vprobe@sha256:8de1a98c3c14307fa4882c7e7422a1a4e4d507d2bbc454b53f905062b665e9d2
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 29 Jan 2024 12:01:30 +0530
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 29 Jan 2024 12:00:42 +0530
      Finished:     Mon, 29 Jan 2024 12:01:29 +0530
    Ready:          True
    Restart Count:  1

Confirm memory issue by Exit Code.
- If the exit code is 137, then the pod is crashing due to memory issue.
Fix the memory issue:
- Decrease the load on NextGen Gateway by limiting the number of metrics.
- Adjust the memory limits for the NextGen Gateway accordingly.