Using air quality data in the most professional, business-friendly way depends on consistent, stable systems.
When businesses decide to integrate real-time air quality data into their products and apps, the intention and expectation is for it to be available and functioning consistently without hiccups. Their customers rely on accurate and relevant data to reduce their exposure to harmful air pollution, and without this, these end users can’t make the best decisions possible for their health. At BreezoMeter, we deliver exactly this reliability, and now include an SLA as part of our contract with our customers.
Let’s start with the basics.
1. What is an SLA?
A Service Level Agreement, or SLA, is a commitment we make for our clients, promising to meet certain criteria, including that the services and data are up and running.
2. What elements are included in BreezoMeter’s SLA?
Uptime relates to the percentage of time that the service is available, meaning that when a user asks the system for a response, they get one. Depending on the data package your company has chosen, BreezoMeter offers XX% uptime. Your users will get a response at least XX percent of the time.
Additionally, there is a latency SLA, which refers to the amount of time it takes for the system to return a response to the user. This number is provided in the magnitude of milliseconds. The latency is defined per type of API call, and is based on all requests, and real-time vs. historical API calls, for example.
Graph representing average response time in milliseconds.
BreezoMeter commits to data freshness, as well. We run our algorithms every hour and make air quality data calculations regardless of how new the data coming from the station. After a certain amount of time that a station has failed to report data, our algorithms fill in without the stale data, using other layers of data available.
Accuracy is an important part of air quality data, and we are continuously checking our accuracy. To learn more about how we continuously monitor accuracy, stay tuned to our blog, where we will have a whole post about our Continuous Accuracy Testing methods.
3. How do we stand by our SLA?
We monitor our commitments both in real-time, and in a monthly aggregation. BreezoMeter’s system includes a lot of moving parts, from the backend that processes all of the data and calculates our proprietary algorithms, to the front-end that serves that data to our customers via our API. All of these parts need to be monitored for the system to remain stable and reliable.
Different metrics of all the parts of our system (like uptime, current latency, freshness, etc) are collected and stored.
The current status of the system is available graphically, so that any problems are easy to quickly identify visually.
Any problems that occur are also sent via alert by an alert management system. Problems that trigger alerts have not necessarily reached the SLA threshold yet, in order to address them before they could. The goal is to avoid breaching any SLA threshold. In order to accomplish this, there are three levels of alerts: Warning, a non-intrusive alert to notify something wrong may occur soon; Error, a more intrusive alert that something is wrong and needs to be handled soon; and Fatal: a critical problem (any SLA breach like high latency, downtime, etc.) This is a very intrusive alert that also calls our on-call team to handle the problem as soon as possible.
We also use external uptime tools to monitor uptime from external services as well.
Long term data collection requires different capability from what is necessary for real-time monitoring. For this longer term picture, we collect all of our SLA commitment data into Google’s BigQuery so we can analyze all the data against our commitments, and see where we can further improve.
There are also alerts on the long term commitment breaches, so we are notified for any monthly breach of the SLA.
4. The BreezoMeter’s SLA Journey for Air Quality Data
When BreezoMeter was starting out, there was a lack of visibility of what was going on in the system. The ability to be notified about problems as they occurred in real-time was missing.
Our system is built from a lot of different services and tools, from back-end to the front-end, and as in all software, problems occur from time to time. Since we had limited visualization of what was going on in the different parts of the system, a lot of times we noticed problems only after they occurred, with no previous warning. Often it was even the case after noticing there was a problem, finding the source of the problem would mean scanning long logs of numerous services, a slow and painful process that meant a problem could last longer than necessary, before it could be resolved.
The first thing we had to do to address this lack of visibility was to start sending metrics of different processes of the system. For this task, we started working with Prometheus as our time series database, and Grafana to visualize it. Initially, it was just about implementing some high level metrics, like the starts and ends of various processes. With these basic visualizations, we were able to identify the places where issues occurred much faster, and scan the logs of the specific service which caused the problem. This alone significantly reduced the amount of time required to debug a problem. Slowly we started utilizing more and more metrics, measuring performance of different services and different behaviors, allowing us to monitor our system health on a better level and granularity, and helping us identify problems and also improve our system performance. We were able to define thresholds that fire alerts immediately on any predefined issue, which means we can catch and handle problems at much earlier stages, before they can even reach the client or their end user.
5. Q&A with the BreezoMeter Customer Success Team
Q: What is it about BreezoMeter’s SLA that is most attractive to our customers?
A: Many say that the most sought after elements relate back to the real-time air quality data that we provide, the minimal API latency, and our spotless 99+% uptime.
Q: What does this mean practically for our customers, in terms of their users' experience?
A: Latency: Customers’ real-time API requests are served within very fast times so that client side applications provide great end user experience.
Uptime: Our SaaS is robust and available 99% of a month (or more). Some customers are entitled to 90-95%, depending on their plan. Companies who integrate air quality data expect a stable solution, so the uptime is a parameter that is important across the board.
This is extremely important to the pharma and medical device / digital health verticals, for example, who rely on our data and trust us with clinical trials, patients notifications etc.
Are you ready to integrate air quality data into your product or technology, and need a stable, consistent and reliable partner?
Uri Hellerman is a developer at BreezoMeter, on the research and development team.