Key Responsibilities
- Monitor cloud infrastructure and services to ensure availability, performance, and reliability.
- Manage cloud-related incidents, service requests, and problem resolution following ITIL processes.
- Perform root-cause analysis and support incident remediation and post-incident reviews.
- Support deployment, configuration, and maintenance of cloud resources across AWS, Azure, or GCP.
- Monitor and optimize cloud resource usage, costs, and performance.
- Assist with patching, upgrades, backups, and disaster recovery activities.
- Ensure cloud environments comply with security policies, governance standards, and best practices.
- Maintain operational documentation, runbooks, and standard operating procedures (SOPs).
- Collaborate with DevOps and engineering teams to support CI/CD pipelines and production releases.
- Track and report operational metrics, SLAs, and system health dashboards.
- Support automation initiatives to improve operational efficiency.
- Provide 24/7 on-call or rotational support as required.
- Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field.
- Hands-on experience with at least one cloud platform: AWS, Microsoft Azure, or Google Cloud Platform.
- Knowledge of cloud services such as compute, storage, networking, and IAM.
- Experience with monitoring and logging tools (CloudWatch, Azure Monitor, Prometheus, ELK).
- Understanding of ITIL processes, incident management, change management, and problem management.
- Familiarity with scripting and automation using Python, PowerShell, or Bash.
- Experience with ticketing tools such as ServiceNow, Jira Service Management, or Remedy.
- Basic knowledge of networking concepts (VPC, subnets, firewalls, load balancers).
- Strong analytical, troubleshooting, and communication skills.