What is Apache Avro? Big Data Serialization
Apache Avro is a data serialization system designed for data-intensive applications. It is a favorite in the Big Data world, particularly with Apache Hadoop and Apache Kafka.
Introduction
Apache Avro provides rich data structures, a compact and fast binary data format, a container file to store persistent data, and remote procedure call (RPC).
Unlike Protobuf, Avro relies heavily on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and incredibly compact.
Key Features
- Dynamic Typing: Avro does not require code generation. You can use it in dynamic languages easily.
- Schema Evolution: This is Avro's superpower. It handles changes to the schema (adding/removing fields) extremely well, allowing old readers to read new data and vice versa.
- JSON Schemas: Avro schemas are defined in JSON, making them easy to read and write.
Example Avro Schema (JSON)
{
"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}Avro vs Protobuf
Avro stores the schema with the data (in container files) or requires it at handshake (RPC). Protobuf relies on generated code having the schema knowledge beforehand.
Avro is generally preferred in the Hadoop ecosystem (Hive, Pig) and Kafka because of its superior schema evolution capabilities, which are critical for long-term data storage where schemas change over time.