What is observability and why is it important?
The idea to split your application into a set of small, interconnected services known as microservices architecture has enabled engineers to develop and release software faster than they ever could with monolithic architectures. This is great, however, it does not come without certain challenges.
Managing application performance proved to be one of those challenges. With today’s complex, distributed environments, it is extremely difficult to grasp the complete picture of all the services and the communication between them.
This is exactly where observability comes into play.
It helps us to understand a system from the outside, without knowing each and every gear that is turning on the inside. All of this happens by extracting and analyzing telemetry data.
But what is telemetry data?
In distributed systems, telemetry data can be split into three major categories, commonly referred to as the three pillars of observability:
- Metrics – “Is there a problem?”
Metrics are a numerical representation of data, extracted from the application or one of its components over time. They can then be presented in the form of graphs, which can give a holistic view of the health and performance of the system or some of its components.
- Traces – “Where is the problem?”
Traces represent the lifecycle of a request that has been made somewhere in the code. Although logs and metrics can be adequate for understanding the behavior and performance of the system, they do not usually provide helpful information for understanding the entire journey of a specific request or action as it moves through the distributed environment.
By analyzing a trace, a developer can better pinpoint bottlenecks, and resolve issues much faster or identify areas that need optimization and improvements.
- Logs – “What is the problem?”
Logs are text records, of events that happened during the execution of some code within an application. They come in the form of plain text, structured or unstructured. A log entry can then be examined by the developer to help him troubleshoot his code or confirm the execution of a certain block of code.
Telemetry data is critical for every DevOps engineer to understand these complex better systems and their applications’ behavior.
So what is OpenTelemetry all about?
Observability begins with instrumenting the application. That is, all of the code has to emit telemetry data. OpenTelemetry facilitates this, by offering means to automatically or manually instrument your application under a unified framework.
Before OpenTelemetry, there wasn’t such a centralized and flexible observability solution. Engineers were stuck with using specific vendors and back-end platforms. Collecting telemetry data from multiple sources in a large organization often involves different stacks and back-ends, each tailored to each team’s needs.
This, however, makes it very difficult to get a single view of the performance of the entire system. With OpenTelemetry you can standardize the format of the collected data from every source and easily send it to various or even multiple back-end platforms. This allows you to have a holistic insight into the entire system and also solves the vendor lock-in problem as
OpenTelemetry is vendor-agnostic and developers could switch the back-end platform at any point of time without the hassle of having to re-instrument their code. The automatic instrumentation capabilities are constantly growing thanks to the active community behind the project and the list of language-specific integrations keeps expanding and maturing with the most popular frameworks already covered. It can automatically capture relevant telemetry data, in addition to handling context propagation to carry execution-scoped values across API boundaries.
Even though OpenTelemetry is a relatively new open-source project, it is already the second most active CNCF project, right after Kubernetes, and is maturing at a very rapid pace, setting its’ path to mass adoption by the industry.
How does OpenTelemetry work?
OpenTelemetry provides a framework that receives, processes, and exports telemetry data to a back-end of choice which will store and visualize the collected data. There are many moving parts when it comes to OpenTelemetry data collection. Let’s see the main components and how they work together.
APIs and SDKs:
APIs are used by the developers to instrument their code. There are already APIs available to use across the most popular languages, such as Python, Java, Go, and so on. All of those APIs share the OpenTelmetry standard and will work with any OpenTelemetry-compatible back-end platform without the need to re-instrument the code. The APIs are then divided into four parts:
- Tracer API which enables the generation of spans. A span is a named and timed operation that represents a contiguous segment of work. Multiple spans, grouped together, form a trace.
- Metrics API which provides various metric instruments, such as counters and observers.
- Context API which can be used by the developers to enrich the traces and spans with context to enable the propagation mechanism.
- Baggage API is a set of key/value user-defined pairs that can be used to annotate telemetry, adding information to metrics, traces, and logs.
The SDKs are also language-specific and support the same languages as the APIs. They are used to collect the data and pass it forward to the in-process (within the application) processing and exporting phases. The processing phase deals with computing the gathered telemetry data, which is then sent to the in-process exporter. The in-process exporter will then translate the received data in a custom format, which can be transported to the OpenTelemetry Collector or directed toward one of the supported backend platforms.
The OpenTelemetry Collector is an optional proxy that receives, processes, and exports telemetry data. While the in-process exporter can transport the collected data directly toward the backend platform, the Collector gives the developer much greater flexibility, such as enabling various sampling strategies that might not be otherwise supported or exporting the data to multiple backends at once. The collector can also handle things like retries, batching, encryption, or even sensitive data filtering.
The Collector consists of three components: receivers, processors, and exporters.
- Receivers are used to obtain telemetry data and get it into the collector. The collector supports various receivers, covering many of the popular formats like Jaeger or Prometheus. It also supports the default OpenTelemetry Protocol (OTLP), which is an agnostic protocol that is designed to be flexible, reliable, and efficient.
- Processors allow us to manipulate the collected data by formatting, filtering, or otherwise enriching it before it gets transported to the export phase. The processors also support batching to compress the data better and reduce the number of outgoing connections needed for exporting it later.
- Exporters can transmit data to one or multiple back-end platforms or destinations such as console output or dumping data into a file. The main task of an exporter is to transform the data into the appropriate format and send it to the configured endpoints. This enables us to switch back-ends at any time without having to re-instrument the code.
While the Collector is completely optional and skipping it by sending the telemetry data directly to the back-end platform of choice is supported, the latter is not recommended for large-scale projects. The Collector can be deployed separately from the application and this is the way it can offload the responsibility of managing the telemetry data, thereby making the whole system much more efficient, not to mention missing all the neat possibilities for data manipulation that the Collector offers.
Why Cloud is Crucial to Leveraging Telemetry
Currently, not all telemetry data is captured due to storage space constraints. Saving all the data would be unfeasible, and most systems have a limited capacity to store only a few weeks or days of telemetry. To alleviate this issue, the data is often sliced into time series, but the storage burden remains significant, necessitating the eventual deletion of older data to make room for new, more pertinent information.
This is why advanced analytics services are commonly hosted in public cloud platforms. The immense storage and computing capacity, coupled with machine learning, provide the technological underpinnings necessary to collect, store, and process vast amounts of telemetry. Through collecting telemetry from various points along the data path, a comprehensive data set can be obtained, leading to valuable insights for organizations. With these insights, patterns, and relationships between seemingly disparate data points can be discovered, leading to improved business performance and customer experience. Therefore, it is essential to emit as much telemetry as possible from various points along the data path, enabling the system to search for patterns and relationships that can uncover actionable insights.
All in all, OpenTelemetry is a great way to standardize and make your observability solution much more sustainable, efficient, and flexible in the long run. The large community is making significant progress towards maturing the project and it is already being adopted by all sorts of large-scale organizations.