APIs are the core mechanism for decoupling front ends from back ends and for decomposing monolithic infrastructures into composable enterprises in the spirit of what's known as digital transformation. They're currently the most significant enablers of innovation, mobility, and the Internet of Things (IoT). APIs enable teams to focus on their core value proposition while allowing customers to achieve bigger goals by connecting to data and functionality with tools they prefer to use. But to deliver on these myriad benefits and objectives, teams must design APIs with scale in mind. However, the need to build high-performing APIs that scale with the business ecosystem is pressuring many development teams to build APIs that may be restricting business growth.
APIs that are built without scale as a consideration will suffer from poor usability, have limited availability, are open to security issues, and more.
Here atRaygun, we recognize that a high-quality API is pivotal to our business growth, and cite scalability as a critical success factor. When we need to grow to meet customer demands, we must handle billions of data points comfortably and with minimum disruption. In our particular case, we're ingestion heavy, meaning we receive billions of requests a day and any delay to our API's response would result in a potentially bad experience for our customers.
The issues we discuss here apply to both reading and writing data through APIs. We'll discuss how Raygun's development team manages the infrastructure and maintenance of our APIs to enable growth.
Why is a scalable API so important to software businesses?
Raygun drives better API-driven customer engagement through our use of SDKs. One of the goals of our SDK architecture is to be lightweight and have little to zero impact on the customer's application performance. While our API itself is a relatively straightforward endpoint, we found that providing SDKs means we can make API access easier, less error prone, and include niceties (e.g., if connectivity is interrupted, be sure to send the data later when it's restored).
As we accept large volumes of data from customers, managing our API effectively is critical to our business. If we can't receive data at volume, then our product would be useless. On average, we get thousands of requests per second to our API, with spikes into the hundreds of thousands per second, so we need to be able to handle a wide-ranging load.
Our product development is not all about the data handling, however. A great UI and nice features are what customers want on the front end, which isn't possible without a robust API.
As Uri Sarid,CTO of MuleSoft, articulates so well, "Much like a great UI is designed for optimal user experience, a great API is designed for optimal consumer experience."
For our survival as a company, our offering a large customer superior data management and a great experience on the front end is mission critical, and we must be able to scale to meet larger customer's needs.
So how do we do it? At Raygun, we look at two main areas when building a scalable API: infrastructure and maintenance.
To mitigate risks, Raygun uses several layers of security for our APIs. All calls are done with a customer's API key and authentication credentials.
A simple first layer is to offer a "regenerate authentication credentials " option. If you choose to re-generate your credentials, the original credentials are no longer valid.
The reason this is essential to protecting your system is to prevent anyone with malicious intent gaining access to your account. For example, if a developer accidentally checked your credentials into a public repository, you're safe because that key will no longer be valid.
After authenticating your credentials, we'll then generate a time-based token for subsequent API calls, expiring after 15 minutes.
Raygun also employs an independent third party to run penetration tests (sometimes callPen Tests) against the service every quarter, alongside automated security tests that are run continually. As attackers become more sophisticated, you must continually invest in security.
Lastly, we undertake security training with our software team and ensure that we review pull requests before being merged, with an eye towards security concerns.
Hosting on multiple servers
When scaling an API, an important approach is to have the same code when running the API requests on multiple servers.
Depending on how you've scaled your systems so far, remember that when anyone makes an API call, the API won't make the request from the first machine available — you don't know which server will get the request, because requests are bounced to different servers.
Raygun uses autoscaling groups to handle volume. An autoscaling group contains a collection ofEC2 instances that share similar characteristics and are treated as a logical grouping for the purposes of instance scaling and management. We also rely on a reasonably sized "warm pool" of servers (those ready to receive requests) that are available for sudden traffic spikes, which enables us to continue providing a great customer experience — even at busy times.
Use the right load balancer to autoscale
Using the correct load balancer for your system is very important for autoscaling your API.
The right load balancer will increase your application's capacity and reliability by sharing the workload evenly across the pool of servers in the load balancer. An ineffective load balancer will do the opposite, and you may find yourself unaware if a server falls over or there's an unknown, critical, and recurring error.
At Raygun, we useAWS load balancing, which is an effective way of building our load balancing service into our infrastructure so we can launch servers on demand.
A high-traffic application like Reddit (who used infrastructure to scale to 1 billion page views per month) uses a mix of load balancing tools likeHAProxy andNginx to direct traffic to each. In the HighScalability.com article, Reddit: Lessons Learned from Mistakes Made Scaling to 1 Billion Pageviews a Month, Jeremy Edberg explains how they use HAProxy for load balancing and Nginx to terminate SSL and serve static content, enabling Reddit to manage billions of data points effectively.
The infrastructure of your API will be dependent on many factors, but we've found the above method is very effective.
Horizontal scale
Horizontally scaling your API means adding more servers instead of adding more hardware.
The practice of using horizontally scaling software is used by companies like Facebook and Google — and it's also the model Raygun uses. We strongly recommend you do the same.
The main reason for using horizontal scaling is because it enables our systems to adjust to the load dynamically by automatically provisioning (or deprovisioning) more systems (nodes), rather than making one system larger. Now, if our system experiences the loss of a single node, the entire system will not collapse. More importantly, horizontally building our systems allows Raygun to scale at the right time.
Using horizontal scaling, the Raygun team discovered we could conserve server resources, and we can add our customers as needed. This way, we only add capacity to our environment as needed. Make sure you're aware of the fine line between better customer experiences and biting off more than you can chew. (As you add new customers, your load increases.)
After horizontally scaling, if you find there's still a bottleneck, you can cache data to improve performance.
This is where we use our own tool, Raygun Crash Reporting, as a guiding light to understand the capacity of our API. We use Real User Monitoring (defined on this Raygun page) paired with Crash Reporting so we can truly understand our software performance.
Queue up everything for better API performance
To get the best performance from your API, do minimal work by queuing (a way of exchanging work between systems with an added buffer for spikes in activity) and having as few processes as possible.
At Raygun, for example, as data comes in, basic validation occurs and it's then passed off to a queue to transition work from one system to another in a scalable and redundant way. We then have backend workers (other processes) pick the next work item off the queue.
As we add more workers, we're careful not to reduce performance. How effective are our workers? We measure them, so we know well ahead of time if we need any more. Our consistent model is that all worker tasks have analytics end points, which then are called and reported into a StatsD endpoint (DataDog in our case). This allows operations to monitor the health of individual workers, as well as building dashboards to show overall system health.
Through consistent monitoring, we understand the capability of our API. Your API will be similar to ours where traffic will come in bursts and follows a business hours model.
Raygun usesRabbitMQ for queuing and DataDog to monitor the capacity of our workers.






