A service mesh is an infrastructure layer that facilitates communication between application components. Meshes provide functionality including service discovery, load balancing, and observability.
Service meshes are generally found in distributed systems composed of microservices. The mesh provides a way for different services to exchange data. It only handles internal traffic, handing off to an API gateway or edge node to serve to users. Popular meshes include Linkerd and Istio.
What’s Inside The Mesh?
The basic concept of a service mesh is quite straightforward. The layer encapsulates network proxies that route traffic into individual services. A control layer manages network calls between the services and orchestrates the use of the proxies.
The mesh’s fundamental purpose is to make it easier and more reliable for services to communicate. Let’s take a simple pair of services: an API and a separate authentication system. The API will need to interface with the authentication service when it receives requests.
With a service mesh, you gain reliable communication between services using known identifiers. Your API could send network calls to the
auth-service hostname and be confident it’ll still resolve correctly, even if the auth service gets moved to a different datacentre location. The service mesh acts to transparently route requests to the appropriate service.
Using a service mesh makes it easier to move services around. Hard-coded network communication logic can quickly become constraining. A mesh gives you more flexibility to distribute your workloads in the future.
How Are Service Meshes Implemented?
Most modern service meshes are formed from two components: a control plane and a data plane. The core functionality belongs to the data plane, which concerns itself with the data flowing through the system.
Each request to the mesh will be processed by the data plane. It’ll discover the proper service to use, perform any logging actions, and enable conditional behaviors to filter and redirect traffic.
Services are usually part of your application so they can include routing logic. The data plane inspects configuration rules to determine the final endpoint based on HTTP headers, request parameters, and other values from the request.
While the data plane buzzes with activity, the control plane is relatively simpler. It oversees the proxies in the data plane and provides an API so you can interact with them.
This diagram from the Istio project illustrates how the components fit together. Each service gets its own proxy. Ingress traffic flows through the proxies. They may communicate with the control plane to discover other services. Proxies are commonly called “sidecars” as they run alongside the service they represent.
Other Service Mesh Functions
Service meshes usually provide additional functionality beyond basic service discovery. You’ll typically find load balancing, request-level authentication, and support for data encryption. Encrypted traffic can pass through the mesh but get handed to your services in decrypted form.
Your mesh will usually offer safeguards against transient network outages. Automatic retries, failovers, and circuit breakers make communication between services more resilient. The end user’s less likely to see a serious error state. These important capabilities would be difficult and costly to implement by hand.
Similarly, service meshes incorporate performance optimizations that help to minimize overheads. The mesh might retain connections to services for later reuse, reducing the latency on the next request.
Mesh layers also provide security protections. You can implement mesh-level access control policies that reject traffic before it reaches your services. If you want a global API rate limiter, making it part of the mesh configuration will apply it to all your services – present and future.
What About the Drawbacks?
The biggest challenge around service meshes is the extra complexity which they bring. While they aim to simplify microservice networking, they also introduce a learning curve of their own.
Developers will need to get acquainted with another layer atop their existing services. Mesh terminologies, such as proxies and sidecars, add to the already lengthy list of Kubernetes resources maintained by operations teams.
Although meshes do add their own performance enhancements, some deployments could notice an overall reduction in network throughput. The mesh adds a new layer that requests need to pass through, impacting overall efficiency. This is normally most noticeable if you stock your mesh with many routing rules.
When Should You Use a Service Mesh?
Service meshes work best when you’re fully committed to containers and microservices. They’re designed to solve the challenges of running large-scale distributed systems in production. Smaller deployments might still benefit from a mesh but could also utilize simpler networking approaches, such as Kubernetes’ built-in networking resources.
Meshes are most helpful when you’re frequently launching new services or distributing them across servers. A mesh lets developers focus on functionality, instead of the tedium of connecting different services together. As you deploy an ever-growing service fleet, you’ll spend more time handling service-to-service communication. A mesh automates most of this configuration so you don’t need to think about service discovery and routing.
The approach helps your system become more observable too. All requests flow through the mesh so you can implement infrastructure-level logging and tracing. This provides a simpler pathway to diagnosis and resolution of routing issues. Without a mesh, you might have dozens of services all relaying information directly to each other. This makes it hard to find the origin of a problem.
Most popular service meshes are easy to set up. Istio and Linkerd both have well-documented getting started guides to help you deploy a mesh in your Kubernetes cluster. It’s worth experimenting even if you’re not sure your system’s at mesh scale yet. Sticking with direct service communication for too long could restrict your ability to launch new services in a timely and reliable fashion.