The internet is a violent place. Cleaning ladies stumble over network cables at unexpected moments, a power outage may kill your Dancer process the next second, and hackers try to block your IP address just for the sake of it. Individual peer processes may disconnect from the network at any time, either by choice or by accident. In P2P jargon this issue is known as network volatility. The capability to deal with the dynamics of a large grid computing network is called network volatility robustness.

In a global GPSdancer network solution, the outputs of every process in the network depend on every other process. The GPSdancer requirements regarding network volatility are therefore simple: the implementation must be 100% robust against any network anomaly that may occur more often than once in a lifetime, otherwise it will never be possible to combine thousands of Dancer processes on the internet in a coherent analysis.

How does GPS Dancer cope with computers that go off-line at awkward moments? Solving this particular problem has taken more effort than any other part of the Dancer development. The experiences of the Dancer project may be useful to other peer-to-peer projects, so that this article will discuss them in some detail.

First attempt: the military approach

The initial GPS Dancer defence against network volatility was based on the semper fidelis principle of the US marines. Always faithful. This approach aimed at maintaining the network structure for performing the square dance algorithm under any circumstances.

Network protection was implemented in the form of "buddy triangles". Any Dancer process that was involved in the square dance algorithm would have two other Dancer processes watching its back.  If the active node went off-line unexpectedly, it was almost immediately replaced by one of its two buddies. In turn, the monitoring tasks of this other process would be taken over by yet another Dancer peer. The buddy triangle approach will be explained here for historic reasons (lessons learned), but do not worry if you cannot follow all details. The system was complex, and in the end it just did not offer the required level of robustness.

The core network in the square dance exchange has a size that is the largest power of two that is not larger than the network size. This core network can always be split into two equal subsets of core nodes and spare nodes. Figure 1 shows a fragment of a network with 47 peers in it, which implies 16 core nodes, 16 spare nodes, and 15 fold nodes.

Buddy triangles are now formed by means of a complicated set of rules which are known to each network peer (...because they all run the same software):

  • All spare nodes in the network connect to each other to form the so-called trace of spare nodes. This is a daisy-chain of all spare nodes, in order of increasing node numbers. 
  • The trace is made into a circular network when the last spare node (number 31 in this example) connects to the first one (number 16).
  • Each odd numbered spare node has a number like \(2m+1\), and connects to two even numbered core nodes with the numbers \(2m-16\) and \(2m-14\) (...for larger networks, this may become \(2m-32\), \(2m-64\), etc.)
  • The even numbered spare nodes \(2m\) connect to two odd-numbered core nodes \(2m-17\) and \(2m-15\).
  • Each spare node \(n\) also connects to the core node \(n-16\), and to the spare nodes \(n-2\) and \(n+2\).
  • The core nodes themselves build all connections to other core nodes that are needed according to the bit-wise orthogonality of the square dance algorithm (this is not indicated in Figure 1). 

All rules like \(2m-16\) require roll-over to take into account that there are only 16 spare nodes and core nodes.

The result of the above scheme is a spectacularly intricate set of network connections that looks a bit like the Eiffel tower on the internet.

The reason for having fold nodes is that the network size is usually not a precise power of two. The fold nodes upload their contribution to a core node before the start of the square dance process, and download the final result afterwards. They do not participate in the actual square dance exchange, nor in the buddy-triangle approach, because it is possible that there are no fold nodes at all (namely, if the network size is a power of two).

The spare nodes also upload their contribution to the core node \(n-16\), and receive the final result from that same core node after completion of the entire accumulation process. because teh spare nodes do not take part in the square dance algorithm, they are now free for guard duty: they will guard the core nodes with their lives.

Figure 1: buddy triangle connections in (...part of) a network of 47 peer processes

Each core node is now forms a triangular mini-network with two spare nodes, called a buddy triangle. An example is the buddy triangle 7-22-24 that is highlighted in Figure 1. The spare node connection 7-23 is used by node 23 to upload its own vector before the square dance process, and to receive the final result afterwards; it has nothing to do with the buddy triangle.

The buddy triangles are formed in such a way that the involved internet connections never coincide with internet connections that are used in the accumulation process, which is obvious from the fact that each buddy triangle only includes one core node, while the square dance network only involves connections between core nodes. Each spare node is nominally involved in two buddy triangles. Each core node is protected by one triangle.

Figure 2: heartbeat signals among the three buddies in a triangle

The three GPSdancer peers in a single buddy triangle now transmit regular heartbeat signals to each other, at 3 second intervals. The contents of an outgoing heartbeat signal reflects the observed status of the incoming heartbeat signal from the third buddy (either healthy, or off-line). For instance, in Figure 2 network node A informs node C that it still receives healthy heartbeats from node B (...or not); etc.

If one of the three buddies goes off-line, its heartbeat signals to the other two buddies will of course stop. After a brief tolerance period (8 seconds) the two remaining buddies will start to report this off-line status to each other, on the only surviving side of the buddy triangle. As a result, the two surviving buddies discover almost simultaneously that the third buddy has gone off-line. This conclusion is irrevocable: the third buddy is considered dead, even if its heartbeat signals resume a millisecond later. We say that the buddy triangle has triggered, and it can only do that once. The two remaining buddies will jump into action, and this action cannot be reversed.

The reason for using triangles rather than pairwise buddy monitoring connections (such as 7-23) is that the latter cannot distinguish between a lost connection, or a lost remote computer. A broken internet connection can usually be repaired, while a dead remote computer cannot. The buddy triangles are designed to detect within seconds, and unambiguously, if a certain network node has gone off-line.

The response to a triggered buddy triangle is swift, and slightly different for a dead core node or a dead spare node.

If a core node goes off-line, one of the spare nodes in its buddy triangle immediately swaps its node number with that of the core node. For instance, if node 7 in Figure 1 goes off-line, node 22 turns into number 7, and (dead) node 7 becomes 22. The new buddy 7 immediately creates all internet socket connections that it needs (as node 7) in the square dance process. It downloads the result from the penultimate successful square dance cycle (...it is now connected to another core node that also has that result!) and it continues the square dance exchange in place of the dead node.

In other words, the spare node 22 takes over the role of the dead core node 7 in all details, and instead of a dead core node 7 we now have a dead spare node 22. Semper fidelis, or what: a spare node dies heroically instead of the core node, in order to ensure that the global network of core nodes remains intact under all circumstances.

If a spare node goes off-line (for instance, after replacing a dead core node) a new buddy-triangle is formed immediately. The remaining spare node in the two affected buddy triangles of the dead spare node will connect to the next (live) spare node in the trace, in the direction of the dead spare node. For instance, if node 22 in Figure 1 goes off-line, node 24 will immediately connect to node 21 and vice versa.

The elegance of this scheme is that if a core node goes off-line, one single new network connection immediately creates a new buddy triangle for monitoring the new core node. In our example, node 7 goes off-line and is replaced by node 22. At the same time spare node 24 connects to node 21. This new network connection builds the new buddy triangle 24-21-7 because the two connections 24-7 (formerly 24-22) and 21-7 (fformerly 21-22) already existed before the old node 7 died.

If some spare node itself goes off-line (as opposed to replacing a dead core node) more than one new network connection is needed to form a new triangle, for instance 7-21 and 21-24. This is fine, because the spare nodes are under less time pressure than the core nodes.

Watching Figure 1 closely, it follows that the spare nodes mainly need to ensure that the trace remains intact: as soon as a spare node disappears, the two spare nodes that now form lose ends of the trace work hard to immediately connect to each other. This is done by JXTA advertisements, because it is not known in advance to which other spare node the two need to connect: there may have been multiple network repairs, and there may be multiple nodes going off-line simultaneously.

After the network repair action some spare nodes may become involved in more than their nominal two buddy triangles, but that is only a temporary problem. During the next global network update (once per hourly Dancer run), the global network map is repaired. Gaps in the node numbers are filled in, and new, nominal buddy triangles are formed as necessary.

Of course, the entire system collapses if there are not enough spare nodes left to form triangles, but in a large network that can only occur in case of a serious anomaly. In that case, we give up.

The buddy triangle approach to automatic network repair evolved gradually over a long period, and is in fact incredibly clever, if we may say so. It can distinguish very quickly between a dead network node, or a dead network connection. Also note that the entire process is decentralized: the network repair transactions do not require any form of central coordination, but are handled among a few local network nodes. It's like kicking a hole in an ant hill - it will be repaired by local worker ants (spare nodes), and most of the colony will never know about it. Even more remarkably: we managed to implement all of this in stable JAVA / JXTA code, and got it working on-line. Unfortunately, the approach merely works most of the time, while we want something that works all the time:

  • During the formation process of the buddy triangles, there are brief periods (at most a few seconds) where no reliable protection exists. If any of the involved computers goes off-line at that precise moment, the system gets confused, with unpredictable consequences. There has been some experimentation with redundant buddy triangles (each core node being monitored by two independent triangles) but this all became hopelessly complicated.
  • It is possible that both buddy triangles of a single spare node trigger almost simultaneously, and the two separate repair actions would require conflicting responses from the poor spare node trapped in the middle.
  • There is always the risk that two buddies in one triangle go off-line within a few seconds of each other, which would not be detected by the third buddy (in fact, the remaining buddy would conclude that it had gone off-line itself!). It was not possible to reduce the heartbeat intervals (and reduce the risk) because remember, we are using a JXTA layer underneath Dancer, and there may be communication delays due to relay peers and firewall crossings.

Heroic self-sacrifice looks rather silly when it fails to accomplish anything. GPSdancer needed a more fool-proof solution; we found it by studying a fool.

The pirate code

Watching the mishaps of Jack Sparrow, we learned about the pirate code. Pirates are much more practical than the US military, and say:

He who falls behind is left behind

In other words, it's every man for himself. Forget about monitoring your buddies. Avoid any risk of being dragged into a fight that you cannot win. Fight only to flee. Semper fidelis to yourself. Better that one dies that might have been saved, than that you die trying to save him.

This all sounds rather heartless, but pirates have no hearts, and neither do computers in a peer-to-peer network. If something goes wrong with the internet connections of your GPSdancer process (or with you computer), that's your problem. Solve it immediately, or die.

Like the Black Pearl shedding weight to outrun the navy, Dancer threw all buddy-triangle source code overboard (...a year of hard work gone to waste) and imposed fixed time-outs on all internet communications. If you do not respond within e.g. 30 seconds after a data request, the computer on the other end of the line will assume that you are dead and will never speak to you again.

Note that you can always return a "please hold" response to an incoming data request, to indicate that you are alive and well, but that your local Dancer analysis has not yet reached the point in its own process ar which it can provide the requested data. This is perfectly normal: Dancer computers run at different speeds, and there is no strict synchronization other than by hourly process starts. After receiving a "please hold" signal, the remote Dancer process will wait for a few seconds, and repeat the request. This can go on for a long time, because it is not an anomaly. Only if you do not respond at all within the time-out interval, the remote process will presume that you are dead, and close the connection. If your response arrives 0.001 seconds later, that's too bad.

The pirate code avoids the need for buddy triangles altogether, which is a great simplification. However - the square dance results can now become inconsistent at different Dancer peers, which is unacceptable. For this reason, each square dance process must now do some basic checks after completing the exchange sequence, to know if the process was successful or not. if not, it must perform some repairs that will be explained here below.

Square dance checks for global consistency

Once that a process is considered dead by another node, this other core node will ignore it completely. It simply continues with the next cycle of its square dance exchange scenario, knowing of course that an expected input contribution has obviously been missed. This looks wrong, but it is in fact brilliant. It was in fact such a brilliant insight that it resulted in the only ever form of payment during the development of the GPSdancer software. A fine bottle of Rioja Gran Reserva 1985 was sent to a post-doc at the Computer Science department of Berkeley University, who came up with this counter-intuitive solution. 

By ignoring unresponsive exchange partners altogether, and simply continuing with the next cycle of the square dance exchange scheme, a healthy core node in the network will always reach the end of its prescribed exchange scenario, and without excessive wait states. The chance of going down along with another node in the network is now zero, which more than justified a bottle of wine. The Dancer network had indeed become fool-proof.

Of course, if one or more of the input contributions in the square dance scenario have not been received, the local process will not end up with an accumulation of all contributions in the entire network. At least one contribution is missing (...if the failure occurred in the first exchange cycle) but possibly half the network contributions are missing (...if the last exchange contact of the scenario was dead). In fact, if more than one exchange connection is unresponsive, a process might end up with an average vector based on much less than half the global network. In the extreme case of not having any functioning exchange sockets, a Dancer process will assume that it has lost all internet connectivity, and it stops with an error. Note also that failed exchange cycles affect two nodes in the current cycle, four in the next, eight in the cycle after that, etc. Many "downstream" network nodes of a dead core node will therefore have an incomplete result, even if all their own exchange contacts are healthy.

Fortunately, after completion of its square dance scenario each process will know if its solution is correct or not, because it knows how many original input contributions have gone into the final result. This number is exchanged along with the vectors, in order to compute proper weighted averages in each successive square dance cycle. If a final result is not based on all computers in the network, there is no reason to panic. As long as one or more of its core connections are still alive, each process can cross-check its accumulation result with that of the other core nodes to which it is connected. There are now two possibilities:

  1. If (...and only if) all its connected core nodes have found the same result as its own result, a core node will conclude that this is the best possible accumulation result in the global network. The square dance process has completed, and apparently one or more computers had gone off-line before being able to upload their contribution to at least one other computer. There is nothing to do about that, and it will not harm the result in any significant way.
  2. If one or more connected core nodes have an accumulation result that is based on more contributions than its own result, the core node will download this better result, and compare it again with the results from all its connected core nodes.

This simple logic works beautifully, thanks to the redundancy in the square dance scheme (information travels via multiple paths) and the orthogonality of the square dance connections at every single network node. It is not possible that all connected core nodes find the same result, unless that is the maximum possible result in the network. The only case in which ithis is not true is when a section of the network has cleanly separated from the rest of the network, which would be an amazing stroke of bad luck.

The iterative cross-checking continues until all connected core nodes agree on the same result. It is easy to see that this simple logic leads to a global spreading-out of the best possible result in the network among all surviving core nodes, at an exponential rate. If one core node has a better result than another core node to which it is connected, this other process will take the better result instead of its own.

In turn, the core nodes to which it is connected will adopt this better result, etc. As long as one computer in the entire network has found the correct global result, all others will receive it after a few cross-check cycles. As long as a core node is alive and connected to at least one other node in the network, it is guaranteed to end up with the best possible accumulation result in the entire network.

To illustrate the pirate code in practice, some examples can be found by following the links below, with step-by-step discussions of what exactly happens. All examples are based on a core network of 32 nodes (=\(2^5\)) so that there are five successive square dance exchange cycles. The examples become increasingly complex, so please follow them in the given order for a proper understanding:

Example 1: a single network node goes off-line during the third exchange cycle

Example 2: a single network node is already off-line before the start of the square dance exchange process

Example 3: both cases above happen within a single exchange process

Example 4: two dependent nodes are off-line before the start of the exchange process

Example 5: two independent nodes are off-line before the start, showing the need for checksums

 

Project details

Ultimate Browsers SupportThe GPS Dancer project started in 2007 as a voluntary project of a working group of the International Association of Geodesy.

Read more ...

Square dance algorithm

Great Docs and SupportThe GPS Dancer system was named after its "square dance" exchange algorithm. Of course, it also wants to to make the GPS reference frame denser.

Read more...

Here, there be pirates

The Dancer on-line network became immune against internet connection problems by leaving the US marines, and becoming a pirate.

Read more ...

Go to top