datatypes · level 7

Binary Data

Bytes, buffers, byte order — and the line between text and bytes.

200 XP

Binary Data

Most of the time you're working with strings, numbers, lists, structs. Sometimes the data is raw — packets off a socket, pixels from an image decoder, bytes off a memory-mapped file. Different languages give it different names — Buffer in Node, bytes in Python, []byte in Go, Uint8Array in browsers — but the underlying object is the same: a flat sequence of bytes you can index, slice, and read multi-byte values from.

Analogy

Picture a printer's tray of metal type. Each tiny block is a single character (a byte). Letters look like letters; punctuation looks like punctuation. A typesetter who knows English picks them up and arranges them. Now hand the tray to someone who only reads Cyrillic, and ask them to lay out the same words. They'll grab the same blocks — same shapes, same metal — and arrange them following completely different rules. The blocks (bytes) are unchanged; the interpretation (encoding) is everything. Switching encoders mid-page produces gibberish that looks like text and isn't.

Bytes are not characters

A byte is an 8-bit number, 0–255. A character is a concept — "the letter A", "the smiley emoji". You convert between them with an encoding.

const bytes = new TextEncoder().encode("café");   // Uint8Array(5) [99, 97, 102, 195, 169]
const text = new TextDecoder().decode(bytes);     // "café"

café is 4 characters. In UTF-8, it's 5 bytes — the é is the two-byte sequence [0xC3, 0xA9]. In Latin-1, it's 4 bytes (é = 0xE9). Same characters, different bytes. The encoding is the contract.

Languages all model "raw bytes" with a dedicated type, distinct from string:

Language Bytes type Notes
JavaScript Uint8Array, Buffer (Node) Buffer is a Uint8Array subclass
Python bytes (immutable), bytearray (mutable) Strict separation since 3.0
Go []byte Slice of byte (alias for uint8)
Rust &[u8], Vec<u8> String and &str are guaranteed UTF-8
Java byte[] Plus ByteBuffer for richer ops
C# byte[], Span<byte> Memory<T> for owned spans

The rule that saves you: never let bytes and strings drift across an API boundary without an explicit encoding. Functions that take String should take String, not "bytes that might be UTF-8."

Endianness

Multi-byte values (a 32-bit integer, a 64-bit float) have to be laid out in some byte order. There are two conventions:

  • Big-endian (network byte order): most-significant byte at the lowest address. 0xDEADBEEF[0xDE, 0xAD, 0xBE, 0xEF].
  • Little-endian (host byte order on x86 and ARM): least-significant byte at the lowest address. 0xDEADBEEF[0xEF, 0xBE, 0xAD, 0xDE].

Within a single CPU it doesn't matter — register loads and stores use the same convention. It matters when bytes leave the machine: writing to disk, sending over a network, sharing across architectures.

const buf = new ArrayBuffer(4);
const view = new DataView(buf);

view.setUint32(0, 0xdeadbeef, false);   // false = big-endian (default)
// bytes: de ad be ef

view.setUint32(0, 0xdeadbeef, true);    // true = little-endian
// bytes: ef be ad de

The convention for wire protocols (TCP/IP headers, BSON, Protobuf for some types) is big-endian, called network byte order. Standard helpers — htonl / ntohl (C), socket.htons (Python), binary.BigEndian.Uint32 (Go), Uint32Array + DataView (JS) — make sure your code does the right thing regardless of which architecture you're on.

Buffers vs typed arrays vs views

In JS specifically, the model is layered:

ArrayBuffer        // an opaque chunk of N raw bytes
  ├─ Uint8Array    // interprets the buffer as N unsigned 8-bit ints
  ├─ Int32Array    // interprets the buffer as N/4 signed 32-bit ints (host order)
  ├─ Float64Array  // interprets the buffer as N/8 doubles
  └─ DataView      // explicit get/set with byte order + offset control

Multiple views can share the same ArrayBuffer. Modify through one, see the change through any other. This is how zero-copy parsing of binary protocols works — point a DataView at the bytes you received from a socket and read fields by offset.

function parsePacket(bytes: Uint8Array) {
  const view = new DataView(bytes.buffer, bytes.byteOffset, bytes.byteLength);
  return {
    magic: view.getUint32(0, false),       // big-endian
    flags: view.getUint16(4, false),
    payloadLen: view.getUint32(6, false),
    payload: bytes.slice(10, 10 + view.getUint32(6, false)),
  };
}

Slicing and zero-copy

Most byte-array APIs offer two operations that look similar:

  • Slice / view (zero-copy) — returns a new object that points at the same underlying memory. Cheap; modifying the original mutates the slice.
  • Copy — allocates fresh memory and copies the contents. Safe; expensive.
const orig = new Uint8Array([1, 2, 3, 4, 5]);

const sub = orig.subarray(1, 4);   // zero-copy view: [2, 3, 4]
sub[0] = 99;                        // orig is now [1, 99, 3, 4, 5]

const copy = orig.slice(1, 4);     // independent copy: [99, 3, 4]
copy[0] = 0;                        // orig unchanged

In Go, s[1:4] is a slice header pointing into the original backing array — same trap. In Python, slicing a bytes object always copies (because bytes is immutable); slicing a bytearray returns a copy too. Different languages, different defaults — read the docs of the type you're using.

Reading and writing structured binary

Three idiomatic patterns.

Field-at-a-time with a typed view:

view.setUint32(0, magic, false);
view.setUint16(4, flags, false);
view.setUint32(6, payloadLen, false);

Pack / unpack format strings (Python's struct, Lua's string.pack):

import struct
header = struct.pack("!IHI", magic, flags, payload_len)
m, f, p = struct.unpack("!IHI", data)   # ! = network order, I = u32, H = u16

Code generation from a schema (Cap'n Proto, Flatbuffers, Protobuf):

protoc --go_out=. message.proto
# generated code reads / writes the binary format with no offsets to remember

For wire protocols you control, schema-based codegen is the safest. For everything else (existing protocols, on-disk formats, low-level performance), a typed view + manual offsets gets you the most control.

Common bugs

Treating bytes as a string. A 4-byte raw integer that happens to start with [0x00, 0x00, 0x00, 0x05] becomes "" in many string APIs (NUL-terminated). Truncation, silent corruption, hours of debugging.

Wrong byte order. A 32-bit integer 0x00000001 written little-endian shows up as 0x01000000 (16,777,216) when read big-endian. Common bug when porting between architectures or talking to a wire protocol the other side controls.

Sign bugs at the byte boundary. A C char may be signed or unsigned (implementation-defined!). Reading byte 0xff and printing it as a number gives 255 or -1. Java has only signed bytes; you have to mask with & 0xff to read it as 0–255.

Aliasing with typed arrays. Two Uint8Arrays that share a buffer will see each other's writes. Useful when you mean it; a footgun when you don't.

Slice vs copy. Returning a slice into a buffer the caller might recycle. The slice quietly contains different bytes a millisecond later. Document whether your API hands back a copy or a view.

Hex / base64 confusion. Buffer.from('deadbeef') reads it as ASCII text, not hex. Buffer.from('deadbeef', 'hex') is what you want.

Practical decisions

  • For raw bytes in TypeScript / JS, prefer Uint8Array + DataView over Node's Buffer for portable code.
  • For wire protocols you design, use a schema-based encoder (Protobuf, MessagePack, CBOR) — they handle byte order and versioning for you.
  • For wire protocols someone else designed, use the language's standard manual encoder (struct in Python, encoding/binary in Go, byteorder in Rust).
  • Always document the encoding at every API boundary that takes or returns bytes.
  • When in doubt, write the byte sequence out as hex and verify it matches the spec by eye.