# Patterns for dealing with uncertainty

In programming we are often dealing with an uncertainity. We don't know,
we are not sure if something happened or not. Or we are not sure about
the result. Especially when it comes to networking/distributed systems but also in other
contexts. What I find interesting is that in many cases the techniques
used to handle such problems are very similar.

<!-- more -->

## Retries / At least once delivery

You tried to send something from computer A to computer B. It didn't work.
What do we do? One of the most often technique is to try to do it again.
Very simple isn't it?

We say _at least once delivery_ because computer B can receive our message
multiple times in case 1st time already worked but computer A wasn't sure
about it so it sent it again.

## Confirmations / Acknowledges

Sometimes it is enough that we know a message reached point B. But often
it is not enough. We also need to know that the message was successfully
recognized and processed. When it worked point B sends a confirmation
to point A.

Notice that what is just a delivery on a higher level (message reached point B)
requires a delivery and confirmation on a lower level (packets consisting the
message reached point B and acknowledge of it reached point A back).

## At most once delivery

Sometimes we prefer speed and smaller usage of resources over certainty or
reliability. In such case, we always send the message only once. Either it
will reach its destination or not. So the strategy is based on lack of
retries.

This can be used in many scenarios where the transmitted value is only valid
for a very short time anyway and next transmission will include a new
version anyway.

A mobile phone sending GPS positions every second. A computer game sending
player's positions constantly. A thermometer sending current temp. In such
cases, the logic behind those systems can probably use a previous value as
a good enough substitute of the new one, if it wasn't received. A new value
will be sent quickly anyway.

## Timeouts

We need timeouts because we cannot wait indefinitely for a confirmation of
a message. If the message was lost or the confirmation of it was lost
waiting longer won't change the situation.

When we reach the timeout we can schedule a retry or just move on depending
on the previously discussed strategies.

Of course, timeouts can cause false negatives. Our system reports a timeout
which is treated as a failure and one second later we can receive a message
saying that everything went OK. But we received it too late. In such case
sometimes we don't need to do anything (retry was already scheduled) but
sometimes we might need to compensate. We cancelled an order and now that
we received a payment confirmation after a timeout we need to also refund
the payment. That is an example of a compensation.

## Idempotence

Idempotence is a way to correctly handle duplicated messages received due
to retries and at least once delivery strategy. The idea is that when we
receive the same message multiple times it does not cause additional
important side-effects and the client is informed that the operation was
successful.

The repeated message may cause minor, unimportant side-effects. Maybe it will be
logged again, maybe some technical metrics will be increased again but
business-wise there is no visible effect.

For example, you can receive an information that a payment was successful.
So you trigger a state transition in your app, schedule order delivery,
email to customer etc. And 1 second later you receive from payment
gateway the same information that the payment was successful. You don't
send to the customer the products twice, you don't report twice that much
revenue for your startup and you don't send another email. You silently
ignore the information, but you respond with information that everything
was OK.

Sometimes idempotence can be achieved easily (state machine that can go
from state X to X without doing anything), but usually, it requires
the effort to detect such situation and handle it properly.

## Exponential backoff

There is often no point in continuing to retry at the same rate i.e.
every second. If the situation does not improve, with every retry
it is less likely that the system that we try to cooperate with will
self-heal. It's better to back off and keep trying but less and less
often. 1 second, 1 minute, 1 hour, 1 day, etc...

Also, some systems randomize retries to avoid a situation where thousands
of affected devices try to repeat something at the same time causing a
self-denial of service attack. Imagine a networking problem which causes
millions of devices running a chat application to disconnect immediately.
If they all try to reconnect instantaneously at the same time with the
same non-randomized strategy then your servers may not be able to handle it. But when
some retry after a second, another group after two seconds, and another after
three seconds, then the load on your server might be more tolerable.
Especially considering that initiating a connection can often be one
of the most expensive operation when systems try to sync their state.

## Commit log / Persistance

RAM is volatile, our application/database processes can die,
servers can be turned off. That's why for really important data
before sending it over the wire we first save it into a more safe
space. A disc.

That way in the worst case we can have a list of messages that
we wanted to send but never had the time to, or were not yet
confirmed. If something bad happens, we can re-read the list of messages
and send them again.

Almost every messaging system that is supposed to be reliable will
send messages to either a disc or replicas running on other machines
first before confirming that the message was queued.

## Sequence numbers

When we keep sending messages over the wire we can number them incrementally.
One, Two, Three, Four, Five...

When the other part receives them it can spot gaps and out of order messages
and request replies or re-order them. It can also confirm multiple messages
up to certain number with one reply (_i am at 5_) instead of confirming each
one separately (_got 1_, _got 2_, _got 3_...). Sequence numbers don't
need to be global. They can be defined per connection, session, stream, etc.

## Client side generated UUIDs

Often the content of a message does not uniquely identify it. For example, we
may receive an order for 2 iPhones. Many people could order 2 iPhones. So to
handle idempotency it might greatly help if every message/request is sent
with unique client side generated UUID. If the client repeats the message it
will use the same UUID. That makes the recipient's job of detecting
duplicates much easier.

## Correlation numbers

Correlation numbers are a mix of sequence numbers and UUIDs. Say a client
sends an order (UUID 321), and later informs about a successful payment (UUID 609
caused by UUID 321) but both messages can get lost.

When the server receives UUID 609 and sees that it is correlated to something with UUID 321
but it has not yet received 321 it knows that it cannot process 609 immediately.
It can save that information and wait for retry of 321 and only when it receives
it, the server will process 321 and then process 609.

In other words, correlation numbers can help you with retries/duplicates/out-of-order
messages which are related to the same business process.

## Reconciliation

Imagine a payment gateway. Your e-commerce system assumes that certain transactions
were successful. But that may be an incorrect or incomplete list. If your system was down
or had reliability problems it could have dropped some messages about successful payments.
If your system was down too long maybe all retries failed and you will never know
about this payment (which you probably should refund, or deliver or pay taxes of).

Unless there is a reconciliation process. It means that the payment gateway system
exposes API or downloadable file with a list of all transactions. Ideally, as immutable,
append-only list of transactions. In such case even long time later you can compare
the list of payments in your system, the list in their system and find discrepancies.

## Conflict-free replicated data types

This is a way of keeping and synchronizing data in your system in a special way.
Multiple independent nodes can even disconnect completely and when they reconnect
later and merge their data together you can be sure that all will reach the same
state. I think that if we, as humanity, ever go into stars we will need more of such structures
to effectively exchange data after disconnecting and reconnecting between multiple space
ships :)

## Summary

These techniques are either already used by databases of your choice or directly
in your applications. There are probably much more of them but these I am most
familiar with and are most popular, I think. But I am certain I missed some of
them. Let me know in comments about other techniques that you know about.
