{{CANONICAL}}
← Back to Tech News

AWS Parallel Computing Service now supports Slurm 25.11

Amazon Web Services has updated its Parallel Computing Service (AWS PCS) to support Slurm version 25.11, introducing several new capabilities designed to improve high-performance computing workload management. The update includes support for a Prometheus-compatible OpenMetrics endpoint that provides real-time visibility into jobs, nodes, and scheduling activities, allowing organizations to integrate HPC monitoring with their existing observability tools. Additionally, the new release features expedited re-queue functionality that automatically reschedules jobs affected by node failures at the highest priority, helping workloads recover more quickly from infrastructure issues. The service enhancement also expands logging capabilities by enabling AWS PCS to send Slurm database daemon (slurmdbd) and REST API daemon (slurmrestd) logs to Amazon CloudWatch Logs, Amazon S3, or Amazon Data Firehose. This improved logging infrastructure is designed to help administrators diagnose accounting issues and debug API integrations more effectively. AWS has also separated scheduler audit logs from operational logs, creating a dedicated log type that gives organizations independent control over ingestion and storage costs. AWS PCS is a managed service that simplifies running and scaling HPC workloads on AWS infrastructure using the popular Slurm workload manager. The service aims to reduce the operational burden of maintaining HPC clusters by providing managed updates and built-in observability features, allowing researchers and engineers to focus on their computational work rather than infrastructure management.

Why It Matters

This update addresses critical pain points in HPC cluster management, particularly around job recovery and observability. The expedited re-queue feature helps minimize the impact of hardware failures on long-running computational workloads, while the OpenMetrics integration allows organizations to use familiar monitoring tools for HPC environments. The enhanced logging capabilities provide better troubleshooting capabilities for complex distributed computing environments, which is essential for organizations running mission-critical scientific and engineering workloads.

Read Original Release →
Note

This summary is generated using AI analysis of the original press release. Always refer to the original source for complete details.