Introduction

NVIDIA Bright Cluster Manager offers fast deployment, and end-to-end management for heterogeneous high-performance computing (HPC) and AI server clusters at the edge in the data center, and in multi/hybrid-cloud environments. It automates provisioning and administration for clusters ranging in size from a couple of nodes to hundreds of thousands. It also supports CPU-based and NVIDIA GPU-accelerated systems, and enables orchestration with Kubernetes.

Monitoring Use cases

  • In case of any threshold breach or unexpected metric behavior, based on configurations the device monitoring helps to collect the metric values with respect to time and sends alerts to the intended customer team to act up immediately.
  • It helps the customer with smooth functioning of business with minimal or zero downtime in case of any infrastructure related issues occurring.

Resource Hierarchy

    NVIDIA BrightCluster Manager
    • NVIDIA BCM Head Node
    • NVIDIA BCM Virtual Node
    • NVIDIA BCM Physical Node
    • NVIDIA BCM Linux Server

Version History

Application VersionBug fixes / Enhancements
5.0.6Support adding the Root Resource UUID as a custom attribute for Nvidia Bright Cluster Manager app.
5.0.5Activity Log, Get Latest Metrics and Debugging Changes for Nvidia Bright Cluster Manager.
5.0.4Fix provided related to component level threshold alerting.
5.0.3Fixed component level threshold alert enabling and disabling.
5.0.2Resource Display Order changes and Sub-Category changes.
Click here to view the earlier version updates
Application VersionBug fixes / Enhancements
5.0.1Curated DashBoard, cache flush changes.
5.0.0Added new Native Type NVIDIA BCM Linux Server and Metrics.
4.0.0Added support for nfs-server metrics on NVIDIA BCM Head Node.
3.0.0Added support for nfs-mount metrics on NVIDIA BCM Physical Node
2.0.0Supported new metric "nvidia_bcm_cluster_smartHdaTemp".
1.0.0Initial Support.