About a year ago, I came to a conclusion that a network of distributed services calling each other was a recipe for disaster. Here is a reminder of the advice I dispensed.
I think it is still good advice, but after researching some of the latest developments in the cloud/container/micro-services space I’ve started to relax a little my stance on how services interact with each other. Before I get to that point, let me share with you some history.
Let’s go back to 1999. Kind of like Prince, except not really.
In 1999, CORBA was a way of building distributed systems (DCOM was another one). It was basically like RPC except it was for objects because objects like cargo pants and the Star Wars prequels were allegedly cool in 1999. Basically, calling a remote service was as simple as sending a message (or calling a method) on an object as if it was local (in the address space). The implementation details of the service that hosted the remote object was not a concern. The fact that that there was a network involved between the client and server objects was not a concern from a programming model perspective. The client and the server could be written in various different languages running in a heterogeneous computing environment. The contracts between the client and the server were defined in a (programming) language agnostic language called IDL – Interface Definition Language.
The magic that made this work was a heavyweight expensive piece of middleware called an Object Request Broker (or ORB). This all sounds wonderful. Except it wasn’t really.
The CORBA system that I was involved in suffered horrific latency and performance issues. The chief problem was that yes, you need to pay attention to the network. It introduces latency and a huge opportunity to impact the overall reliability of the system and so it cannot be ignored. Good OO design principles guide us to use small methods that do one thing and do one thing well. If you expose these small methods directly to distributed clients, you have in affect provided a fine-grained distributed interface. This is where the problems begin.
Consider something like this:
response = myRemoteObject->doSomething(aParameter);
Looks fairly innocent, but let’s look under the covers (my memory may be a little off, but I recall it looks something like this).
- Locate the remote service – where does it live?
- Make (or re-use) a connection to the remote service through the ORB
- Encode the request into IIOP (i.e. serialization or marshall-ing)
- Walk down the network stack on the client side
- On the receiving end – walk up the network stack (server-side)
- Decode the request (de-serialization)
- Invoke the request on the object
- Actually doSomething()
- Encode the response
- Send the response back to the client
- Walk down the network stack (server-side)
- Walk back up the network stack (client-side)
- Decode the response
- Return it
It’s quite easy to see where the latency is introduced. Every step is pretty expensive – just to actually do something useful (step 8). And as for resiliency – what if the service went away or became un-reachable?
So what did we learn from this experience?
We learned that course-grained distributed interfaces work better than fine-grained interfaces. In other words, we learned to reduce the coupling (number of interactions) between the distributed components of the system to reduce opportunities for failure and latency.
Jump forward to 2015. What’s changed? Well IIOP has gone away – replaced by JSON+REST+HTTP. There is no ORB – so the client has to locate (or discover the service somehow). SOAP web-services came (and went). Infrastructure (thanks to the cloud) is (should be) much more dynamic where hosts are setup and torn down often and continuously (i.e. you cannot rely on 10.0.22.10 always being there for you).
So, what is my advice now?
When my service has “something to announce” –
- use a message bus to send and forget (but what if the message bus is not available – good question!)
When my service needs “something from another service”
- still ask yourself – did we get the boundaries right?
- discover the service location dynamically through some discovery approach
- put a circuit breaker between you and the other service to prevent cascading failure and provide resiliency
- expect that the service you are depending on may not be available/reachable
- understand the costs of distributed calls (i.e. metrics)
When my service needs to compose “many things from many different services” into a single response
- an API gateway is a good example of this
- parallelize the calls to the different backend services (look at this)
- the speed of the call should be similar to the speed of the slowest backend service
There is a lot to pay attention to. So why I have I relaxed my stance a little?
It’s mostly because I spent time researching Spring Cloud. They have made it trivial to build dynamic and resilient micro-services in the presence of direct point-2-point integration between services. They do this by integrating tried and tested technologies such as Hystrix (a circuit breaker implementation from Netflix) amongst others. It’s definitely Worth looking into.