Overview
Amazon SageMaker is a fully managed machine learning service. With Amazon SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and directly deploy them into a production-ready hosted environment.
SageMaker provides:
- An integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you do not have to manage servers.
- A common machine learning algorithms that are optimized to run efficiently against extremely large data in a distributed environment.
With native support for bring-your-own-algorithms and frameworks, Amazon SageMaker offers flexible distributed training options that adjust to your specific workflows. Deploy a model into a secure and scalable environment by launching it with a single click from the Amazon SageMaker console. Training and hosting are billed by minutes of usage, with no minimum fees and no upfront commitments.
Amazon SageMaker Ground Truth
High-quality training datasets by using workers including machine learning to create labeled datasets.
See GroundTruth metrics.
Amazon SageMaker Training
An Amazon SageMaker training job is an iterative process that teaches a model to make predictions by presenting examples from a training dataset. Typically, a training algorithm computes several metrics, such as training error and prediction accuracy. These metrics help diagnose whether the model is learning well and will generalize well for making predictions on unseen data. The training algorithm writes the values of these metrics to logs, which Amazon SageMaker monitors and sends to Amazon CloudWatch in real-time.
See Training metrics.
Amazon SageMaker Endpoint
Creates an endpoint using the endpoint configuration specified in the request. Amazon SageMaker uses the endpoint to provision resources and deploy models.
Amazon SageMaker Transform Job
Use batch transform when you need to do the following:
- Preprocess datasets to remove noise or bias that interferes with training or inference from your dataset.
- Get inferences from large datasets.
- Run inference when you do not need a persistent endpoint.
- Associate input records with inferences to help interpretation results.
External reference
Setup
To set up the integration:
- Select SageMaker GroundTruth in AWS Integration Discovery Profile to discover AWS SageMaker GroundTruth.
- Select SageMaker Training in AWS Integration Discovery Profile to discover AWS SageMaker Training Job.
- Select SageMaker EndPoint in AWS Integration Discovery Profile to discover AWS SageMaker Endpoint.
- Select SageMaker Transform Job in AWS Integration Discovery Profile to discover AWS SageMaker Transform Job.
Event support
CloudTrail event support
- Supported (Sagemaker GroundTruth, Training, Endpoint, Transform Job)
- Configurable in OpsRamp AWS Integration Discovery Profile.
CloudWatch alarm support
- Not Supported
Supported metrics
GroundTruth metrics
OpsRamp Metric | Metric Display Name | Unit | Aggregation Type |
---|---|---|---|
aws_sagemaker_labelingjobs_ActiveWorkers Number of workers on a private work team performing a labeling job. | ActiveWorkers | Count | Sum |
aws_sagemaker_labelingjobs_JobsSucceeded Number of labeling jobs that succeeded. To get the total number of labeling jobs that succeeded. | JobsSucceeded | None | Sum |
aws_sagemaker_labelingjobs_DatasetObjectsAutoAnnotated Number of dataset objects auto-annotated in a labeling job. | DatasetObjectsAutoAnnotated | Count | Max |
aws_sagemaker_labelingjobs_DatasetObjectsHumanAnnotated Number of dataset objects annotated by a human in a labeling job. | DatasetObjectsHumanAnnotated | Count | Max |
aws_sagemaker_labelingjobs_DatasetObjectsLabelingFailed Number of dataset objects that failed labeling in a labeling job. | DatasetObjectsLabelingFailed | Count | Max |
aws_sagemaker_labelingjobs_TotalDatasetObjectsLabeled Number of dataset objects labeled successfully in a labeling job. | TotalDatasetObjectsLabeled | Count | Max |
aws_sagemaker_labelingjobs_JobsStopped Number of labeling jobs that were stopped. | JobsStopped | Count | Sum |
Training metrics
OpsRamp Metric | Metric Display Name | Unit | Aggregation Type |
---|---|---|---|
aws_sagemaker_trainingjobs_CPUUtilization Percentage of CPU units used by the containers on an instance. | CPUUtilization | Percent | Average |
aws_sagemaker_trainingjobs_MemoryUtilization Percentage of memory used by the containers on an instance. | MemoryUtilization | Percent | Average |
aws_sagemaker_trainingjobs_GPUUtilization Percentage of GPU units used by the containers on an instance. | GPUUtilization | Percent | Average |
aws_sagemaker_trainingjobs_GPUMemoryUtilization Percentage of GPU memory used by the containers on an instance. | GPUMemoryUtilization | Percent | Average |
aws_sagemaker_trainingjobs_DiskUtilization Percentage of disk space used by the containers on an instance. | DiskUtilization | Percent | Average |