How The Pandemic Made Infrastructure Leaner

At CreditVidya, much like every other tech firm, cost-optimization is the first thing that business leaders rushed to do at the onset of the pandemic-induced economic slowdown. We achieved a cost reduction of 35% by optimizing the infrastructure using Amazon Web Services.

As a fast-growing startup working with big data, we opted for end-to-end cloud services for storage and to host our applications. Even with some optimization and efforts at operational efficiency, cloud services remain a significant part of our infra cost. Naturally, it was one of the main cost centres we tackled, when Covid hit.

CreditVidya’s AWS footprint

infra_usage

Step 1: Cost analysis

Regular tagging of usage by teams, environment and other key factors is a prerequisite of analyzing the cost of cloud usage. We used AWS Cost Explorer to get a granular report. This best practice allowed us to immediately analyse our usage pattern. It took us a short time to obtain reports at an individual level. We found that 90% of our billing was from S3 and EC2.

Now for tackling this, by optimizing their use.

Optimizing S3 usage

Upon analyzing the data usage patterns across all the teams, we were able to categorize them into four buckets based on usage. We then created the life cycle policies on them.

Type of data by usage Action
Data which is not required after a definite amount of time Purge
Data which might be used for some future purposes Archive using S3 glacier deep archive. (lowest cost option)
Specific data set for which we do not know the data usage patterns, i.e., frequent, and infrequent usage Move to S3 intelligent tiering. (cheaper than S3 standard)
Data that is used daily Store in S3 standard (default)

Doing this reduced our S3 costs by 30%.

Optimizing EC2 use

Utilization of Reserved Instance (RIs)

RI’s come with a long-term commitment, which is why it is important to use them efficiently. We found that we were underutilizing RIs. Hence, we allocated them to the applications where we could have the best use and to ensure they were 100% used.

Converting to a Savings plan

We had been following Standard/Convertible Reserved Instances for our applications. With the help of the AWS Support team, we reviewed their savings plans (Compute Savings Plans and EC2 Instance Savings Plans). We realized that these plans were better suited to CreditVidya, and switched to them. Now the plans adjust instances every hour and we no longer need to spend time adjusting it according to the needs of the application. I strongly recommend that everyone with flexible compute choices can check out these plans.

Using spot instances

Using spot instances allowed us to take advantage of unused EC2 capacity. What’s more, these cost about 90% less than on-demand prices - a great way to reduce the cost of computing.

We had explored spot instances earlier as well, but we were hesitant to opt for spot instances for two reasons.

Reason 1: If the spot instance is terminated due to price issues, how do you ensure the maximum availability of the application?
Spot instances are fairly easy to manage with AWS autoscaling groups and a launch template. You can specify the allocation strategy. We used a capacity-optimized allocation strategy as it provides less interruption. However, the cost of the spot instances is higher.

Note: Don’t keep to a single instance type. Weight multiple instance types, as it ensures high availability by launching the instance of one or the other type, when there are any spot interruptions.

Reason 2: How do you minimize the error rate when a spot instance terminates?

You can use AWS spot interruption notices to handle the spot interruptions effectively. When Amazon EC2 is going to interrupt your Spot Instance, it emits an event two minutes prior. This event can be detected by Amazon CloudWatch Events. To reduce the rate of error, instead of handling them, we combined spot and on-demand instances in ASGs.

Snapshot & Volume

Storing every historical snapshot means a lot of wastage and expense. We used AWS Trust Advisor to catalog our resources and identify old snapshots which are not required. We deleted them and reduced storage. I recommend using the AWS Trust Advisor tool to provision your resources in real-time.

Most of our instances use EBS volume as root volume. Since we didn’t have a good policy on EBS size selection in launching new instances, it had led to an increase in the volume of unused EBS. Therefore, we identified the EBS size according to the applications and recreated all instances accordingly. To increase the efficiency and decrease the cost, we recently migrated from gp2 to gp3 EBS volume type.

Infrastructure as code with ECS

Instead of manually managing our infrastructure and configuring a CI/CD pipeline with GitHub, we configured a CloudFormation template to automate the entire process of managing and configuring the infrastructure. This made management of infra easier in more than one way:

  • Simplifying the process of creating replicas via seamless deployment of multiple environments (Dev, Test, UAT, and Production) simultaneously
  • Easy integration with CI/CD pipelines
  • Simplified maintenance of multiple configurations for different AWS services.

Through these initiatives, we reduced our costs by approximately 35%. The pandemic made us push our limits to optimize cost and to put better governance on the infra and engineer solutions that fundamentally changed how we manage infra.

Summing it up

  • Always. Tag. Usage.
  • (Re)assess usage priorities based on business plans, forecasts.
  • Generate periodic usage reports and/ or review usage based on your revised priorities.
  • Review plans periodically.