Re-Reading Tanenbaum’s Critique of RPC 30 Years Later

John Day | 2018

There is an old theory from 19th C biology “Ontogeny Recapitulates Phylogeny.”
What a phrase! It just rolls off the tongue and makes one sound so intellectual! ;-) It even shows up in song. It is the theory that embryos go through all of stages of evolution the organism went through. It turns out it isn’t true in biology, but it does seem to be true in Computer Science. About every decade or so, we have to recycle all of the bad ideas of the previous generation. We have been through the assembly language to high-level language transition at least 3 times!

In the late 1980s, Remote Procedure Call was a big fad in CS that lasted far longer than it should have and in some quarters never went away. The advent of multi-core processors and much more parallelism in computing have given it new life. 30 years ago, Andy Tanenbaum wrote Critique of Remote Procedure Call that generated a lot of discussion. Recently, the subject came up again and I re-read the critique, which rekindle many old, as well as new, thoughts on the subject.

I have never been a big fan of RPC. I just couldn’t see what the big deal was. Once in a while back then, I would meet a professor who would wax eloquent about RPC, my reply would be an excited, “O, yea! You mean like co-routines in COBOL?” Of course, this comparison filled them with horror and strong objections, but they were never able to explain the difference . . .

The problems with RPC are a good indicator of the problems that plague computer science itself. So lets consider the issues Tanenbaum raises as well as a couple of my own:

A Taxonomy
A long time ago, I was taught that a sort of natural progression in capability from: a function which was a module with no local variables, parameters passed by value, that could not reference global variables and returned a value; To a subroutine which was a module with local variables, parameters passed by reference, does not return a value, can only modify its parameters. Originally, subroutines could call functions, but functions could not call subroutines or other functions. (These rules have been loosened a bit over the years.) To Procedure which has local variables (which may a hold their value between invocations) parameters passed by value or by name and ‘thunks’* are possible, or if inferior hardware or a lazy compiler writer call by reference, replaces call by name and no thunks. Can access variables outside the procedure if they are in scope. May return a value or not, and can invoke other procedures including themselves.
* A thunk is passing an expression as a parameter by name. Each time the formal parameter is referenced in the procedure, the expression is re-evaluated.

1) Term Inflation. There is a widespread tendency in computing to adopt more important sounding names for something than is either warranted or is just plain wrong, such as, calling the graph of a network, a topology; calling the chief engineer on a project, the architect; calling a protocol, an API; calling almost anything a new paradigm; etc. There seems to be some deep-seated insecurity in the field that feels a need to inflate the importance of the concepts we work with. As if we are trying to make them sound more complex, instead of emphasizing the simplicity. When all it really does is, show the level of illiteracy in Computer Science and distracts from having a clear picture of what we are doing.

RPC is yet another example. Given the constraints on RPC, it certainly does not qualify as a procedure. (Tanenbaum points this out but goes no further.) RPC couldn’t even be called Remote Subroutine Call. It is closer to the semantics of a Fortran Function call! . . . or for that matter a COBOL co-routine. (see sidebar)

Precision of terminology is critically important. It lets us clearly see what is involved in a problem. Being imprecise tends to paper over what we have not thought about and to disguise what we think we know. It lets us delude ourselves. And to those who know what the words mean, it makes us look foolish. Precision is critical if we are to understand what we are doing. Often the deeper understanding by having precise terms will reveal solutions to that were invisible before.

2) A Procedure is a Language Construct. Procedures exist as part of a programming language. The idea of using a procedure without the context and infrastructure of the programming language in which it exists is ludicrous. The construct of a procedure relies on the rest of the language infrastructure for it to make sense. What are its scoping rules; what sorts of parameter passing are available, i.e. by value, by reference, by name; etc. Not to mention, there is no concept of the distributed program in which the Remote Procedure exists, nor the distributed process instantiated when that program is executed that defines the scope within which the procedure exists. All of this contributes to the chaos that is the eventual state of most RPC applications. There is no structure. To define a ‘Remote Procedure Call’ without support for all of this is absurd. This comes in the category of ‘what were they thinking?!’

3) Distributed Operating Systems. Tanenbaum makes a major point that RPC “is widely used in distributed operating systems.” However, it does not appear that the distributed operating systems community believes in using operating systems as a model for distributed operating systems. As noted above, a ‘procedure’ requires the context of the programming language, but it also requires the support of the operating system. RPC needs and must use the context of the Distributed OS. Where in the so-called “RPC paradigm” (another example of term inflation) noted in the first line of Tanenbaum’s critique is the operating system to support RPC? An OS consists of 4 fundamental functions: processor scheduling, memory management, interprocess communication (IPC) and the ability to manage these 3. The distributed aspects of these four are required to support RPC.

4) By the second paragraph of Tanenbaum’s Critique, we have already gone far beyond the problems he notes. It is no wonder that RPC has so many problems. It starts off failing to provide the basic infrastructure for RPC.

5) Networking is InterProcess Communication (IPC). Tanenbaum points to “An alternative model that does not attempt any transparency in the first place is the virtual circuit model” and points at the OSI Reference Model and goes on to talk about the ‘virtual circuit model.’ This will correspond to the early editions of Tanenbaum’s Computer Networks textbook, which did have a lot of misconceptions (and only slightly fewer now), so we can grant him a little leeway, but this description is inaccurate.

Neither networking (nor the OSI model ) provides a ‘virtual circuit model.’ A virtual circuit is a specific network technology that consists of pre-defining a path for all packets to follow for the duration of the virtual circuit. The technology does not require packets to carry the complete address of the destination but much shorter circuit or channel identifiers that identify the specific circuit between two points on the path. The identifiers are changed at every relay. If a link or switch goes down, the circuit has to be re-created from the beginning. This is not the model provided to the Application. Tanenbaum then goes on to say, “a full-duplex virtual circuit [sic] using a sliding window protocol is set up between the client and server.” Depending on the interpretation of ‘client’ and ‘server,’ this is wrong two ways: First, this is a reference to some sort of error and flow control protocol, e.g. HDLC, TCP, OSI TP4, etc. which does use a sliding window protocol, but not between the client and server applications, but invisible to the client and server applications in the layer below and hence of no relevance here. Or this refers to the boxes, the systems, acting as client and server. This then would be an ITU beads-on-a-string model but the communication is not between boxes but between processes in the boxes. Neither of these characterize what networking does.

The model that OSI (and earlier network architectures) provided to the application was IPC. Since the early 1970s the model for modern networking has always been operating systems and IPC, until the advent of the Internet. In fact, when RPC is between processes in the same or different systems, it is implemented by IPC. All network architectures model RPC as an application layer protocol.

Even more curious is Tanenbaum’s comment that networking ‘does not attempt any transparency in the first place’! It is completely transparent, delivering nothing more than what is passed to it. RPC, a procedure, is loaded with additional semantics. IPC assumes nothing, except allocate the resources for communication, transparently transfer data (perhaps many RPCs), and deallocate those resources when finished.

Part of the controversy in the 1980s was that some contended that RPC was more fundamental than IPC. That one should build networking from RPC, rather than the other way around. As evidence, the supporters pointed to the paper by Spector that Tanenbaum cites. Spector [1982] developed a taxonomy for the semantics for remote operations: maybe once, at least once, only once, etc. This is based on the number of messages exchanged and the range of failures that might be encountered, i.e. lost or corrupted messages, server crashes, etc. Spector ignores (or assumes) that the fundamental conditions for this to work are all unstated assumptions. He assumes that it will not be necessary to fragment messages, have sequencing, acknowledgements (other than the response of the RPC), or flow control.

However, Spector and Tanenbaum were unaware of an earlier much more fundamental result [Watson, 1980]): The necessary and sufficient condition for correct operation of even a request/response protocol requires an upper bound on 3 times: Maximum Packet Lifetime (MPL), maximum wait time until an Ack (Response) is sent (A), and the time to exhaust retries, (R). All RPC models will not only have to obey these rules, but will also need in addition to a request-id that can be matched to the response: sequence numbers to order requests and responses that are longer than the maximum packet size, to detect lost or duplicate messages, detect corrupted messages, fragment and reassemble requests and responses, retransmission control, and flow control. None of these have anything to do with ‘procedure calls.’ In other words, RPC will have to include an error and flow control protocol in order to implement an error and flow control protocol. RPC has enough problems to handle (which are ignored) rather taking on the complicating issues of providing IPC with an error and flow control protocol.

Furthermore, it has been shown that the class of error and flow control protocols have a very different structure than RPC [Day, 2008]. Error and flow control protocols are symmetrical and naturally cleave between a ‘data transfer’ task that does ordering, fragmentation/reassembly, data corruption detection, and delimiting and updates a state vector, while a ‘data transfer control’ task which reads the state vector and manages the slower somewhat more complex feedback mechanisms of flow and retransmission control. Their only interaction is for the control side to impose flow control by turning off a queue once in a while. RPC requires IPC, but IPC does not require RPC, nor do any error and flow control protocols. Attempts have been made to use these asymmetric RPC related schemes to build IPC. They always result in an overly complex cumbersome to use solutions. The asymmetry is a prime contributor to these problems.

It is not an uncommon occurrence that looking at a problem in a distributed environment uncovers assumptions that were not apparent in the single system environment. To truly understand something, it is often very illuminating to consider how it would be done in a distributed environment.

6) Section 2.1. “Who is the Server and Who is the Client?” The short answer might be, why does it matter? If we go back to our previous question: do these terms refer to processes or boxes? The answer is still it doesn’t matter. The ARPANET developers were careful to ensure that hosts could be both clients and servers, and there are many examples of processes that receive requests from another process, perform some function on the data and pass it on to another function, etc. In fact, Tanenbaum provides an example from Unix pipes:

sort <infile | uniq | wc –l > outfile

 

And asks who is the client and who is the server? The key to this answer goes back to our 2nd comment: the lack of the infrastructure of the language. This construct uses a push down stack, so that the output of what is inside the angle-brackets is the input to sort. More generally, RPC does not support recursion. No surprise, . . . neither does a Fortran Function call! ;-)

The other problem Tanenbaum is worried about is ‘what does wc do with its output?’ Here again not defining the model of RPC and the stated assumptions is at the root of the problem. Tanenbaum ignores the question of where did the command line come from in the first place, i.e. the shell process representing the user; and how did the file get accessed, i.e. there was a system process reading and writing the file. Once these are identified, it is clear who is doing what. And some processes do act as both client and server. Why not? Why should this be a big deal? Why can’t a “server” do an RPC? What is good for the goose is good for the gander! This is like preventing a procedure from calling another procedure. That’s absurd! O, that’s right! a Fortran Function call can’t call another Fortran Function. ;-)

Ignoring an abuse of the phrase, this (and much of RPC) runs up against ‘the trick with reduction ad absurdum is knowing when to stop.’ They just keep going to the absurd. This is trying to generalize from concepts that can’t support the generalization. Overall, in the world of system design, symmetry is highly desirable and asymmetry is tolerated when necessary. RPC is at best tolerated and best only used in the corner of the world, where it is applicable. Its advocates have been trying to generalize from a special case that is too restrictive.

Slightly off-topic comment: Recently when the exciting new ‘paradigm’ of peer-to-peer [sic] appeared (Here we go again!). I was excited to learn what this bold new networking concept was all about. When I asked one of its champions, the answer was “A host can be both a client and a server at the same time!’ Errr, that was true the day the ‘Net was turned on. That has always been the case. To my shock, there was nothing beyond that. And to add insult to injury, all of the peer-to-peer protocols were client-server, not peer protocols at all.

7) Section 2.2 Unexpected Messages. There is only an ‘unexpected message,’ if the process has no idea what to do with it. If there is a message the application understands, then it was expected. Of course what this section is about is that RPC doesn’t handle interrupts very well. Or should we say it doesn’t handle responses that weren’t requested very well? Some models have kludged this by having a request with multiple replies, which might never have a final response. Interrupts are part of any operating system. If RPC is more fundamental, then it must handle interrupts. [Note that if RPC had been invented before about 1960, when all operating system I/O was polled, then there would be no problem. Polling fits the RPC model.] The reason this is a problem is for the reasons listed above: There is no infrastructure to support RPC.

For this problem, Tanenbaum describes two situations: a terminal concentrator and sharing a file server. He indicates that RPC can work nicely for the terminal server with the concentrator requesting input from the terminal, in other words, polling. Not really. I have written this code. It really needs to be interrupt driven, otherwise too much time is lost polling terminals with nothing to send and it severely limits how many terminals can be supported. Even with a fast typist and echoing every character remotely (some systems do that), terminals are very slow devices.

Tanenbaum should look at Telnet more closely. Contrary to what most textbooks say, Telnet is not a remote log-in protocol, but a terminal device driver protocol. And while most terminal-to-host protocols were asymmetric and would seem to lean toward RPC, Telnet is symmetric. Making it a character-oriented IPC facility. This brilliant insight greatly increased its flexibility and created a model that made other issues simple. Again, not all problems are symmetric but the solutions are better when they are. Telnet did have to solve the half-duplex terminal problem and here again they found a brilliant way to see both cases as extremes of the same problem. The solution rested on having a clear picture of the model and the observation that the protocol didn’t have to be half-duplex, only the API to the layer above had to handle half-duplex.

For another real example: UIUC-CAC put the first Unix system on the ‘Net in 1975, the only “IPC” Unix had was pipes (a blocking RPC-like facility). Because of limited kernel memory, Telnet had to be a user process. Telnet can expect input at any time from either the network or from the user. To implement it, they had to have two processes: one for incoming traffic; one for outgoing, and hack stty and gtty for the necessary coordination between them. It worked, but it wasn’t pretty. The next thing that was done was to design and implement real IPC for Unix. ;-) RPC would have been just as bad. A similar experience occurred years later trying to do IPC on Apollo workstations, which only had an asymmetric RPC-like mailbox facility. The result was cumbersome, complex and a pain to build code on.

Tanenbaum’s file server example is another example of taking an idea beyond its domain of applicability. The idea that a process can only be a client or only a server is religion, not engineering.

8) More of the Same. For the rest of Tanenbaum’s critique most of the problems can be traced back to issues already encountered. The single-threaded case where the complaint is that an RPC can’t return a null or no-op seems to be one of pedantry, why can’t a RPC return a null result? Procedures do! (Yea, okay, I remember: Fortran Function calls always have to return a value.) Again the lack of the language infrastructure. The Byzantine Generals problem that one can’t ensure a graceful close is a problem of not distinguishing the application connection from the IPC connection. One more indication that RPC is not more fundamental than IPC. But most of the others are really more of the same: Not being precise and not defining a model, not making an RPC a procedure call, not providing the distributed OS support for the RPC, etc.

Concluding Remarks. It has always seemed that what underlies the whole RPC phenomenon is a deep-seated fear among computer scientists of asynchrony. They go to great lengths to keep everything very deterministic, very linear, when the problems are precisely the opposite. Many of the problems encountered are from trying to make an inherently asynchronous problem synchronous.

We have found [Day, 2008] that there is only one application protocol and only 3 operators (and their inverses) that can be performed remotely: Create/Delete, Read/Write, and Stop/Start on an object model. At the application layer, the API ceases being IPC and is a programming language interface. Those 6 operations and the object model on which they operate are the intermediate language to be compiled to. This makes the object model (or perhaps restricted to a subset by an application connection) available to an RPC program, the beginnings of the language model for a distributed program.

If one uses OSs as the model for Distributed OSs, (gosh, what a radical idea!) ;-) one finds that the purpose of ‘memory management’ in the Distributed OS is to make data available to the programs in a timely manner, i.e. avoid the distributed equivalent of ‘page faults.’ The other mistake that most Distributed OS designs make is that they try to treat remote storage (distributed paging) the same as local disk storage. How absurd! It is clearly part of a storage hierarchy and where it belongs in that hierarchy is determined by the access time. This does imply that the storage hierarchy is relative to a given member of the distributed application. The same data stored in the same place will be at different levels of hierarchy relative to different processing. An interesting problem! The purpose then of distributed storage management is to make data available in a timely manner for each locus of processing. With this, one finds that distributed programming is local programming. A rather nice result. (A sort of decentralized solipsism.) But then, what does scope of reference mean in this environment? All of this could certainly put RPC in a different light.

All in all, RPC is not a paradigm. RPC is an abstraction. All abstraction is invariance. In a system design there are levels of abstractions. Each lower level of abstraction (invariance) is created within the bounds of the invariances of the higher levels. We have learned that it is important that the lower levels do not break the invariances. Breaking invariance gives rise to the infamous “surprise”: the devil is in the details. Whereas, our experience has shown that working within the invariances often leads to discovering ‘angels in the details.’ Simple solutions appears that we didn’t expect.

There is a probably a proper place for RPC in the levels of systems invariances, but it certainly isn’t at the higher levels. Most of Tanenbaum’s examples show that.

It would be interesting to actually investigate the properties of a true Remote Procedure Call and see what its advantages and disadvantages are. What would a true remote procedure call in a distributed programming language look like?

___________________________

[1] In one of the longest running musicals ever, the Fantasticks, in the song, Plant a Radish.

[2] Full disclosure: I am a former Algol programmer on Burroughs machines, so I know what a procedure is supposed to be. ;-)

[3] More full disclosure: I was the Rapporteur for the OSI Reference Model. I have been through it word-for-word (arguing over most of them) more times than I care to remember!

[4] Contrary to the early direction of network research, the Internet was developed to the ITU beads-on-a-string model. This can be seen today with their heavy use of ISDN terminology.

[5] The general statement is that ‘the necessary and sufficient condition for synchronization for reliable data transfer requires the bounding of these times.’  However, even unreliable transfers must obey them or never repeat a request identifier. “Never” is a very long time. Contrary to what is taught in all current networking textbooks, the so-called 3-way handshake has as much to do with establishing synchronization (connection establishment) as chanting ‘abracadabra!’

[6] The UNIX documentation is pretty clear about what is assumed by the syntax for command lines.

[7] This is also bad grammar, analogous to saying irregardless. Who else does a peer talk to than another peer? Peer-to-slave? Peer-to-master? Another example of the field adopting terminology that doesn’t reflect well on our intelligence.

[8] Also known as how to make a silk purse into a sow’s ear.

[9] Even though terminal handling is no longer a big issue, this is wonderful example of how the right model on a problem makes the problem simpler. A good lesson in design.

[10] University of Illinois at Urbana-Champaign, Center for Advanced Computation.

[11] Yet more full disclosure: I wasn’t involved in doing that code, but my office-mate was. He and I are still working together. ;-)

[12] Today’s OSs should be distributed OSs, but few have taken notice of it.

[13] Think of a paradigm as the axiom set for a domain of discourse, e.g. the distinction between Euclidean and non-Euclidean geometry; terra-centric vs helio centric astronomy; classical vs quantum physics, etc.

References

Tanenbaum, A.  “A Critique of the Remote Procedure Call Paradigm”

https://www.cs.vu.nl/~ast/Publications/Papers/euteco-1988.pdf

Spector, A. “Performing Remote Operations Efficiently on a Local Computer Network,” CACM 25:4,

April 1982: 246–260.

Watson, R. and Fletcheru, J. “An Architecture for Support of Network Operating System Services.”

Computer Networks 4, 1980: 33–49.

Day, J.  Patterns in Network Architecture: A Return to Fundamentals.  Prentice-Hall, 2008.

 

View all posts