10 Best Practices for Cloud Infrastructure Optimization

Cloud Infrastructure Optimization

To avoid a complicated situation where you get a bill thousands of dollars than you expected in your cloud spend, your critical applications like EBS, ERP, SAP etc on a public cloud like AWS, Azure of Oracle Cloud Infrastructure (OCI), you will have to track the usage metrics, such as utilization, capacity, availability and performance. For example, to identify spending waste, it is fundamental to look at how much we’re using a resource and compare utilization with its provisioned capacity.

Furthermore, if we take actions to reduce spend by reducing the infrastructure footprint, we want to make sure we’re not impacting availability and performance. Here the role of Rightsizing cloud comes into play.

Cloud Infrastructure Optimization: Best Practices

  1. Turnoff idle instances not in use: You can do a simple and scalable server less mechanism to automatically shutdown or delete Compute virtual machine (VM) instances that are marked as idle
  2. Keep CPU usage to 55-65%: Do not oversize CPUs, start with what is required now and grow if needed. Best performance can be obtained between 55-65% average CPU utilization with spikes up to 95%.
  3. Remember time of day is a key factor: During the workday, traffic gradually ramps up, requiring the resources of one to three additional machines, variable by day and time of day. Use Horizontal scaling accordingly.
  4. Practice elastic computing: Instantaneous pay-as-you-go access to VM allows you to design an elastic computing environment as per your needs.
  5. Understand what can and what can’t be right sized: Depending on the billing plans there maybe components which cannot be right sized, however this will necessitate a change to the appropriate cost models
  6. Leverage intelligent automation solutions / Auto Scaling: Automated approach to increase or decrease the compute, memory or networking resources they have allocated, as traffic spikes and use patterns demand.
  7. Review your cloud regularly with critical evaluation of spikes: Over cloud apps can handle scaling demands & unpredictable traffic spikes can make or break applications.
  8. CPU is not the only metric to monitor: Continuously observe metrics, policies and rules to govern them. Beyond CPU, you can monitor memory, and network bandwidth as well.
  9. Engage experts who can do this work with low-cost resources: Cloud specialists partners can be engaged to execute this task and continuously recommend best practices to optimize cost v performance.
  10. Share regular reports with LOB stakeholders on the performance vs costs: While LOBs are where these costs are apportioned, it’s important to sensitize them on the performance of their applications which will eventually build more mind share for future projects.

Details of LOB/ Operational needs should be a key consideration while following these best practices for cloud optimization instances. It is important to critically assess performance characteristics of each instance, risks and application viability assessments along with standardizing policies for reviews and tracking.

Leverage ITC’s Oracle cloud certified experts for accurate infrastructure visibility capabilities, to baseline application workloads and throughput before migration to assist in the initial sizing. We help you gain clear visibility into workload patterns such as what “normal” looks like, peaks height and frequency on the cloud.

Pre & post migration (whether with load testing or real customers in production), resource utilization and delivered experience and throughput, is collected to adjust the architecture you need on OCI. Timelines can be configured to focus on a specific test cycle or peak period.

5 Key Considerations for Cloud Optimization

  1. Rightsizing for peaks: Determine your workload demand by considering the peaks that occur during your observation period and not the average utilization. You don’t want to end up in a situation when your resized instances are not able to handle peak workload anymore. If you experience peaks that are much higher than the average, consider serving such peaks by scaling out and distributing workload across multiple smaller resources.
  2. Assessing constraints: As you change your allocation size, the new size may be subject to constraints that you need to be aware of. Select only allocation sizes that are compatible with your workload requirements. For example, compute instance flavors are provided with a number of CPUs and an amount of RAM, but also network and storage bandwidth. If you are using only half the CPU but the entire storage bandwidth, sizing down an instance may negatively impact its storage performance. Similarly, if you’re running a 64-bit operating system, you can’t select a 32-bit instance, even if this is cheaper and can still deliver the performance you need.
  3. Mitigating availability risk: Some services like compute instances require a disruptive operation (a reboot) to change size. Conversely, application PaaS services have options for zero-downtime upgrades so that incoming requests aren’t dropped. For example, Oracle Application Container Cloud can be configured to do zero-downtime updates via subscribing to Automatic Kubernetes upgrades updates. For those services requiring disruptive operations, factor in availability risks and mitigate them by executing rightsizing only during maintenance windows and limiting rightsizing activities to once a week or once a month.
  4. Mitigating performance risk: As you change your allocation size, the new size may not be able to deliver enough performance to serve your workload demand. Mitigate performance risk by inspecting application metrics from application performance management (APM) tools. Alternatively, rightsize in multiple steps and measure the performance impact at each step. Implement continuous rightsizing and be ready to size up as you detect performance issues.
  5. Starting with the top wasters: If you find a large number of rightsizing opportunities, start with resources that have the highest costs and lowest utilization. Calculate a ratio between the two metrics values and order the identified overprovisioned resources based on that ratio in descending order. Tackle the list from the top down.

Optimization is an efficient capacity management practice for any allocation-based cloud service. This practice is necessary to achieve savings because cloud providers ask their client organizations to choose an allocation size for their provisioned services. However, as providers increase their serverless capabilities, the concern for dynamic capacity management will be shifted to the cloud providers themselves.

Providers that implement continuous rightsizing will be providing serverless capabilities that dynamically scale services based on observed demand. Oracle Autonomous database services is an example of a cloud service for which the cloud provider implements continuous rightsizing behind the scenes, discharging clients from this concern and unlocking cost benefits for dynamic workloads.