.png)


This article describes the process of building a High Performance Computing (HPC) cluster using AWS ParallelCluster Service (PCS), from creating a custom Amazon Machine Image (AMI) to running jobs on a fully operational SLURM-based cluster.
The primary objective was to deploy an HPC environment that met specific requirements, including the use of Python 3.10, preinstalled dependencies, automated infrastructure provisioning, and scalable job scheduling. To achieve this, a custom AMI was built using Packer, followed by the deployment of a PCS cluster using Terraform. The entire workflow was automated with Terraform, GitHub Actions, and shell scripts.
After deployment, the cluster was validated by executing several SLURM commands to ensure that compute nodes, scheduling, and job execution were functioning correctly.
AWS ParallelCluster Service (PCS) is a managed service that enables high-performance computing (HPC) on AWS. It’s designed for running parallel workloads such as simulations, ML training, or large-scale data analysis.
Using PCS, you can deploy and manage HPC clusters without manually configuring compute nodes, networking or schedulers. It integrates seamlessly with services like Amazon S3 and AWS Batch, supporting complex workloads efficiently.
Traditionally, deploying and managing HPC clusters required deep expertise in cluster configuration, job scheduling, and infrastructure management.
AWS PCS abstracts much of this complexity by offering:
-Fully managed cluster orchestration
-Integration with SLURM scheduler
-Elastic scaling based on job demand
-Infrastructure as Code (IaC) support with Terraform and CloudFormation
This makes AWS PCS an excellent choice for researchers, data scientists, and DevOps engineers who want to focus on workloads rather than infrastructure
PCS Cluster Provisioning with Terraform
The PCS cluster was provisioned using Terraform with the awscc provider. The infrastructure is defined as code and includes the cluster, login/compute node groups and job queue.
Note: The awscc provider was required because PCS resources are not yet supported by the standard AWS provider. It uses the AWS Cloud Control API to manage newer services like PCS.
📢 Official Announcement — Terraform Support for AWS ParallelCluster Service (March 2025)
A PCS setup typically includes three main components:
Cluster: Defines general settings, scheduler configuration, and networking.
Node Groups: Specifies login and compute nodes.
Queue: Manages job scheduling and execution PCS Setup with Terraform.
Below is a simplified example of how a PCS cluster can be created using Terraform and the awscc provider:
Example: PCS Cluster Definition
# Create PCS Cluster using awscc provider
resource "awscc_pcs_cluster" "this" {
count = var.enable_cluster ? 1 : 0
name = var.cluster_name
size = var.cluster_size
scheduler = {
type = "SLURM"
version = var.cluster_slurm_version
}
networking = {
security_group_ids = [aws_security_group.this.id]
subnet_ids = var.subnet_ids
}
}
networking = {
security_group_ids = [aws_security_group.this.id]
subnet_ids = var.subnet_ids
}
}
Key parameters such as cluster_name, subnet_ids, and cluster_slurm_version are fully parameterized, allowing the same configuration to be reused across multiple environments.
Running an HPC cluster can become costly, as each compute node is backed by Amazon EC2 instances that incur charges while running.
To control costs, a Terraform variable was introduced to enable or disable PCS cluster deployment:
# Control cluster deployment
enable_pcs_cluster = true # create the cluster
# enable_pcs_cluster = false # skip deployment
This approach allows the cluster to be deployed only when needed. For example, during development or testing phases, the cluster can be disabled to avoid unnecessary expenses.
Custom AMIs play a critical role in HPC environments by allowing required software and dependencies to be preinstalled on compute nodes. This significantly reduces bootstrap time and ensures consistency across nodes.
1.Start from a base AMI (e.g., Ubuntu or Amazon Linux).
2.Install required system packages and runtime dependencies.
3.Build the AMI using Packer.
4.Use the custom AMI when provisioning PCS compute nodes.
Example Packer Template
{
"builders": [{
"type": "amazon-ebs",
"region": "{{user `aws_region`}}",
"source_ami": "{{user `source_ami`}}",
"instance_type": "{{user `instance_type`}}",
"ssh_username": "{{user `ssh_username`}}",
"ami_name": "{{user `ami_name_prefix`}}-{{user `distribution`}}-{{user `architecture`}}-{{isotime `2006.01.02-15.04`}}",
"ami_description": "{{user `ami_description`}}"
}],
"provisioners": [{
"type": "shell",
"inline": [
"sudo yum update -y",
"sudo yum install -y gcc make python3"
]
}]
}
This template can be extended to include additional dependencies as required by specific workloads.
SLURM was selected as the job scheduler within PCS due to its wide adoption and flexibility in HPC environments. It handles job queueing, scheduling, and node allocation, allowing efficient sharing of compute resources.
During the setup, Python 3.10 was required; however, the default Amazon Linux 2 AMI provides Python 3.7. To address this, a custom AMI was built using Packer based on the Ubuntu Server 22.04 LTS (arm64) Marketplace image.
The custom AMI included:
-Python 3.10 preinstalled
-Common HPC build tools (gcc, make, python3-pip)
-Additional Python libraries required by the workloads
Using a custom AMI ensured compatibility with the application stack and minimized configuration time when new compute nodes were launched.
Once the PCS cluster was deployed, basic SLURM commands were executed to validate the environment:
# Display cluster and node status
sinfo
# Submit a test job
echo "echo Hello from SLURM" > test.sh
sbatch test.sh
# View the job queue
squeue
These tests confirmed that the scheduler was active, compute nodes were available, and jobs were executed successfully across the cluster.
AWS ParallelCluster Service provides a powerful and managed approach to running HPC workloads on AWS. By combining PCS with Terraform and the awscc provider, Infrastructure as Code principles can be fully applied to modern HPC deployments.
SLURM offers a familiar and flexible scheduling system, while custom AMIs built with Packer enable deep customization and reproducibility. Together, these components form a scalable and maintainable HPC platform well-suited for research workloads, AI/ML training, and large-scale data processing.