datatypes · level 9

Serialization

JSON, Protobuf, MessagePack, CBOR, Avro — wire size vs CPU vs schema evolution.

220 XP

Serialization

Every byte that crosses the network or the disk has been serialised. The format you pick determines how big the bytes are, how fast they pack and unpack, whether old code can still read new data, and how much friction you'll have debugging in production. There is no single right answer — there's a right answer for this contract, and the contracts in your system probably want different formats.

Analogy

Picture three ways to send a recipe across the country. Option one: write a long letter explaining every step in plain English (JSON). The recipient can read it without any tools, and a typo doesn't kill them — they can still figure it out. Option two: send a printed form where every cell is labelled by number (Protobuf). Smaller envelope, faster to fill, but the recipient needs the same blank form to read it. Option three: send a heavily abbreviated shorthand only you and the recipient understand (MessagePack / CBOR). Smaller than the letter, more readable than the form, but won't make sense to anyone outside your shop. None is wrong; the best choice depends on whether you're posting it to one cousin or printing it on the box of every cake mix in the supermarket.

The format spectrum

Format Schema? Wire size CPU Self-describing Schema-evolution support
JSON no large slow yes weak (field names)
YAML no larger slowest yes weak
MessagePack no small fast yes weak
CBOR no small fast yes weak (RFC standard)
Protobuf yes smallest fastest no strong (field numbers)
Avro yes small fast partial excellent (schema registry)
Cap'n Proto yes smallest fastest no strong (field numbers)
Flatbuffers yes smallest zero-copy read no strong
Thrift yes small fast no strong

The headline trade-offs:

  • No schema (JSON, MessagePack) — easy to start, easy to debug, but every byte includes the field name as a string.
  • With schema (Protobuf, Avro) — fields are integers on the wire; you need the schema to read; smaller, faster, evolves cleanly.

JSON

The lingua franca of HTTP APIs. Strengths: universal tooling, debuggable with curl, every language ships a parser. Weaknesses: no native binary, no integer/float distinction (all numbers are double-ish), no comments, no trailing commas, slow on the wire.

{ "id": 42, "name": "Ada", "tags": ["new", "vip"] }

Use JSON when:

  • You expose an HTTP API to a third party.
  • The payload is small (< 10 KB).
  • Debuggability beats performance.
  • You don't control both ends.

MessagePack and CBOR

Same shape as JSON, smaller bytes. MessagePack came first; CBOR is the IETF-standard variant (RFC 8949) used in WebAuthn, COSE (signed messages), CoAP. Both compact integers, store strings as length-prefixed bytes, and avoid quoting.

import { encode, decode } from "@msgpack/msgpack";

const bin = encode({ id: 42, name: "Ada", tags: ["new", "vip"] });
// roughly 18 bytes vs JSON's ~40
const back = decode(bin) as { id: number; name: string; tags: string[] };

Use MessagePack / CBOR when:

  • You want JSON's shape but smaller / faster.
  • You don't have a stable schema (or don't want to manage one).
  • You're targeting an environment that values bytes (mobile, IoT, signed payloads).

Protobuf

Schema-based binary serialisation by Google, the workhorse of gRPC. You write a .proto file:

syntax = "proto3";

message User {
  int32 id = 1;
  string name = 2;
  repeated string tags = 3;
}

The compiler generates a class in your language with serializeBinary / deserializeBinary (or pythonic equivalents). On the wire, fields are tagged by integer (the = 1, = 2); names never appear.

A 22-byte JSON payload becomes a ~10-byte Protobuf payload — and parses 5–10× faster.

import { User } from "./generated/user_pb";

const u = new User();
u.setId(42);
u.setName("Ada");
u.setTagsList(["new", "vip"]);
const bytes = u.serializeBinary();   // 10 bytes

Schema evolution rules (this is the killer feature):

  • Never reuse a field number. If you remove a field, mark its number reserved.
  • Adding new fields is safe in either direction. Old code skips unknown fields.
  • Don't change a field's type (mostly). int32int64 is safe; int32string is not.
  • Don't make a field required. Proto3 dropped required because mistakes there can never be fixed.

Use Protobuf when:

  • You control both client and server.
  • You're doing high-throughput RPC (gRPC).
  • You need predictable schema evolution.
  • Wire size or parse speed matters.

Avro

Apache Avro pairs binary encoding with a schema registry. The schema travels separately from the data — perfect for huge data files (every Hadoop file format uses it) where the schema cost amortises across millions of rows.

{ "type": "record", "name": "User", "fields": [
  { "name": "id", "type": "int" },
  { "name": "name", "type": "string" },
  { "name": "tags", "type": { "type": "array", "items": "string" } }
] }

Avro's schema registry tracks compatible reader / writer schemas. A consumer reading 2024 data with a 2026 schema gets automatic field migration as long as the changes followed the rules.

Use Avro when:

  • You're in the Kafka / Hadoop / Spark ecosystem.
  • You write huge files (millions of rows of the same shape).
  • You need schema-evolution rules enforced centrally, not in client code.

Self-describing vs schema-required

The line above splits the formats into two camps:

  • Self-describing (JSON, MessagePack, CBOR) — the bytes contain enough metadata to be parsed without an external schema. You can print(decode(bytes)) and see something useful.
  • Schema-required (Protobuf, Avro, Cap'n Proto) — the bytes are dense; you need the schema to find the fields. Without it, you see hex.

Self-describing is great for ad-hoc API design, debugging, and one-off scripts. Schema-required wins when correctness, evolution, and bytes-on-the-wire matter — which is most production internal systems above a certain scale.

Schema evolution, in detail

The thing that separates "we'll figure it out" formats from "we want to deploy across 200 services without breaking" formats is schema evolution. Two directions:

  • Backward compatibility. Old clients must keep working when the server adds a new field. (New field is optional / has a default.)
  • Forward compatibility. New clients must keep working when the server hasn't been updated yet. (New client tolerates missing fields.)

In a large microservice deploy, you're never updating everyone simultaneously — you have to deploy A, then B, then C, with all three pairs of versions valid at once. Both directions matter.

Schema-based formats encode this in the wire format itself. JSON/MessagePack expect you to enforce it in code (and you mostly will forget).

Picking the right format

A practical decision tree:

  1. Public HTTP API for third parties? JSON. Don't think.
  2. Internal RPC, you control both ends, throughput matters? Protobuf via gRPC.
  3. Internal RPC, you control both ends, throughput doesn't matter? JSON is fine.
  4. Huge data files, batch jobs, schema discipline? Avro (in Hadoop / Kafka) or Parquet (columnar; not on this list because it's analytics-shaped).
  5. JSON shape but smaller bytes, no schema infra? MessagePack or CBOR.
  6. Mobile / embedded with extreme size constraints? CBOR or Protobuf.
  7. You need zero-copy reads? Cap'n Proto or Flatbuffers.

Common bugs

JSON Number precision. Large IDs (9007199254740993) get truncated by JS's Number. Send them as strings, or use a parser that supports BigInt.

Field-number collisions in Protobuf. Two engineers add fields concurrently with the same number. Bytes silently misparse. Reserve numbers in the proto file before reusing.

Forward compatibility forgotten. A server adds a required field; old clients fail to send it; the server starts 400ing. Fix: never add required (proto2 syntax); always default new fields.

Schema registry skew. Two services pick up different versions of the registered schema; one writes data the other can't read. Pin schema versions per deploy.

JSON null vs absent. Some APIs treat { "name": null } and {} the same; others don't. Document which and stick to it.

Round-tripping floats. JSON.parse(JSON.stringify(0.1 + 0.2)) is 0.30000000000000004 — same precision in, same out. Don't rely on round-trips for decimals.

Practical decisions

  • For a fresh public API: JSON over HTTPS, schema documented in OpenAPI.
  • For a fresh internal RPC: gRPC + Protobuf, schema in version control.
  • For migrating from JSON to binary: MessagePack first (zero schema work), then Protobuf if bytes still hurt.
  • For data lakes: Parquet (columnar) for analytics, Avro for streaming.
  • For everything: pin schema versions, never reuse field numbers, forbid required.