+49 228 5552576-0

Protocol Buffers, Etch, Hadoop and Thrift Comparison

Thomas Bayer

By: Thomas Bayer
Date: 03/30/2010
Last Update: 10/06/2011

Discontent with SOAP based Web Services, REST and CORBA caused the rise of some new projects for distributed computing. These technologies are used by big Web 2.0 companies like Facebook and Google to exchange gigantic amounts of data in real-time. This article describes this development and compares the particular frameworks.

1. Why not using Web Services?

Applications of different technologies can get connected via SOAP and the WS-Standards family. But the Web Services stack is very complex. Therefore for many users Web Services are overenineered and the learning effort is too big. Using XML implicates even more overhead and is yet another point of criticism.
Current Web Services stacks, which use modern STAX parser perform very well. For example, the roundtriptime for simple calls went down to under 1 ms. This is good for the most business applications, but Web 2.0 or cloud computing require better preformance. In contrast to order processes or electronic fund transfer processes, transaction security is completely irrelevant for an internet research or for a social network. Instead, there is a need for very fast exchange of gigant amounts of data. So, the demand for a fast and simple communication framework, which is easy to deploy is not surprising.

2. Properties of Serialization Frameworks

Serialization Frameworks have some properties in common, e.g. the usage of a binary format to encode messages. In the following paragraphs, these common properties are explained. Afterwards, the several frameworks are introduced to you.

2.1 Interface Description

Web services interfaces are described using the Web Service Definition Language. Like SOAP, WSDL is a XML-based language. The new frameworks use their own languages, that are not based on XML. These new languages are very similar to the Interface Definition Language, known from CORBA.

message Book {
 required int32 id = 1;
 required string title = 2;
 repeated string author = 3;

 message Author {
   required string name = 1;
Listing 1: Description of an Interface for Google's Protocol Buffers

Using Codegenerators you can generate code from the descriptions for various programming languages. Whith the generated code it is easy to serialize an object or to call a service. Besides the enterprise languages C# and Java, often other popular scripting languages like Perl, PHP, Python or Ruby are supported.

Library.Book book =
           .setTitle("My Stories")
   FileOutputStream fos = new FileOutputStream(new File("book.ser"));
Listing 2: Object Generation and Serialization using Java and Protocol Buffers

The Web Services community neglected these scripting languages. SOAP libraries are available for almost every programming language. In contrast, WSDL is not supported in lots of scripting languages or is implemented incompletely.

2.2 Versioning

In most middleware technologies the addition of a parameter makes a remote interface incompatible. This even happens to XML-based Web Services, if the service designer hasn't looked ahead. For example, by using optional elements with minOccurs=0 or anyType. By contrast the new libraries and frameworks for serialization just have been developed to supply evolutionary extentions. In most of the tools you can append additional fields to the data structure without producing failures on clients.
Protocol Buffers, Thrift and most of the technologies we are describing here use dynamic typified meta protocols, which send their data including meta informations. These meta informations are used by the receiver to do an assignment/mapping of the fields, even when some fields were added or deleted. An anew compiling isn't necessary. Of course this kind of versioning has its limits, but for example, it spares the deployment of new clients, after the server accepted a new field. Statically typed systems like CORBA or RMI would reqire an update of all clients in this case.

2.3 Binary Format

Binary formats are used to get an higher throughput. In binary format the informations about the data type and the fieldnames are missing when you transfer data structures. Therefore some tools are using parameter numbers for the allocations of the fields. To save even the last byte Googles Prototocol Buffers uses a variable type for integers that is using only as much space as needed.

2.4 Footprint

The size of the used runtime libraries is not too big. Often it is below under 1 megabyte. For comparison, Apache CXF needs about X megabytes.

2.5 Performance

To keep things simple a lot is missing in the new frameworks. For example the extensibility of XML or the splitting of metadata (header) and payload (body). Most of the frameworks also avoid HTTP. Therfore, the performance is up to ten times better in comparison with Axis2, .NET or the JAX-WS reference implementation. Of course the performance depends on the used operating system, programming language and the network.

3. The Frameworks

The following sections describe the frameworks and their pecularities.

3.1 BSON

BSON means as much as "Binary JSON". BSON is a binary data format which for JavaScript Object Notation. The document-orientated database MongoDB uses BSON as storage format for documents and objects. You can also use BSON released from MongoDB for serialization.

3.2 Etch

Etch is a framework used to build and consume network services. Etch was developed by Cisco, which commited the project to the Apache Software Foundation. Etch uses the Network Service Description Language to describe interfaces. NSDL is also used to describe services, timeouts and authentication. Remote precedures can called via TCP, UDP or SOAP.

3.3 Hadoop Avro

The Apache Hadoop project differs from other projects. So, Hadoop isn't just useful for the serialization of objects and for RPC functionality. The project contains following components:

Component Description
Avro Serialization of data
Chukwa Data collection system
HBase Distributed database for large tables
HDFS Distributed filesystem
Hive Infrastructure for data warehousing
MapReduce Framework to handle large data sets
Pig Language to modelize dataflow and framework for distributed computing
ZooKeeper Coordinator for distributed applications

Avro can be compared to the other frameworks and is used by Hadoop for serialization. Its victory in the Terabyte Sort Benchmark 2008 proves the capable performance of the Java-based framework. A Hadoop cluster from Yahoo just needed 209 seconds for finishing a sorting task.

3.4 Hessian

Hessian is a binary protocol for web services developed by Caucho Technology. Caucho is well-known for their Resin Application Server. Hessian is dynamically typed and therefore supports dynamically typed programming languages without the need for an interface description. There are Hessian protocol implementations for a lot of programming languages and platforms. With hessdroid there is also a port available for Google's mobile platform Android.

3.5 Protocol Buffers

Protocol Buffers is a format for data serialization. Google developed Protocol Buffers to solve problems with the versioning of interfaces. Today Google uses Protocol Buffers to store and exchange data in a lot of Google internal applications. Google puplished C++, Java and Python implementations of the Protocol Buffers format as open source software.

3.6 Thrift

Thrift originates from Facebook and by now it is hosted in Apache's Inkubator. Besides a language-independent serialization of objects Thrift offers a stack for RPC calls. Like Google's Protocol Buffer Thrift allows the evolutionary modifications of datatypes.

4. Comparison

In table 2 you can find the features and properties of the frameworks.

BSON Etch Hadoop Avro Hessian ICE Protocol
Initiator MongoDB Cisco Apache ZeroC Google Facebook
License ASF 2.0 2 ASF 2.0 ASF ASF GPL and commercial 7 BSD ASF
Versionierung soft 6 soft 3 soft 3

Language Code
optional 5
Output Format Binary 4 Binary Binary Binary Binary,
Exceptions No
Cyclic Structs No
Type Inheritance No
Authentification No
Transport TCP, UDP,

Implementation Language Java
Minimal Footprint 1 288 k <400k

Table 2: Comparison of the Frameworks

Supportes Languages

Table 3 lists the supported languages

BSON Etch Hessian Avro ICE Protocol
Cocoa 8
Ruby 8

Table 3: Supported Languages


1 The Java versions were used for the Comparison
2 The BSON libraries are contained in the MongoDB client drivers. The drivers apply to the ASF 2.0 license
3 Allows the evolution of the interface
4 There are more output formats available by using extensions
5 Generation of code isn't neccessary, even not for RPC. Code generation can be used for the optimization of statical typified languages
6 Avoro uses a schema, which describes the data structure. In this schema fieldnames are specified to detect the differences
7 License: dual, GPL + commercial (free as long as you don't sell your app)
8 Language is only available at the clientside

5. Resources

Protocol Buffers - Developer Guide - code.google.com
Metaprotocol Taxonomy - Caucho Technology
Thrift Protobuf Compare - Benchmarking - code.google.com
Hessian Binary Web Service Protocol - Caucho Technology
BSON - mongoDB
Apache Hadoop - The Apache Software Foundation
hessdroid - android hessian library - code.google.com
Etch Project Incubation Status - The Apache Software Foundation
Etch - Overview - The Apache Software Foundation
Thrift: Scalable Cross-Language Services Implementation - Mark Slee, Aditya Agarwal and Marc Kwiatkowski, Facebook
Welcome to Avro - The Apache Software Foundation

6. Contributions

Thanks to Karl Waclawek for additional details about ICE RPC Framework.