Binary Data
Bytes, buffers, byte order — and the line between text and bytes.
Binary Data
Most of the time you're working with strings, numbers, lists, structs. Sometimes the data is raw — packets off a socket, pixels from an image decoder, bytes off a memory-mapped file. Different languages give it different names — Buffer in Node, bytes in Python, []byte in Go, Uint8Array in browsers — but the underlying object is the same: a flat sequence of bytes you can index, slice, and read multi-byte values from.
Analogy
Picture a printer's tray of metal type. Each tiny block is a single character (a byte). Letters look like letters; punctuation looks like punctuation. A typesetter who knows English picks them up and arranges them. Now hand the tray to someone who only reads Cyrillic, and ask them to lay out the same words. They'll grab the same blocks — same shapes, same metal — and arrange them following completely different rules. The blocks (bytes) are unchanged; the interpretation (encoding) is everything. Switching encoders mid-page produces gibberish that looks like text and isn't.
Bytes are not characters
A byte is an 8-bit number, 0–255. A character is a concept — "the letter A", "the smiley emoji". You convert between them with an encoding.
const bytes = new TextEncoder().encode("café"); // Uint8Array(5) [99, 97, 102, 195, 169]
const text = new TextDecoder().decode(bytes); // "café"
café is 4 characters. In UTF-8, it's 5 bytes — the é is the two-byte sequence [0xC3, 0xA9]. In Latin-1, it's 4 bytes (é = 0xE9). Same characters, different bytes. The encoding is the contract.
Languages all model "raw bytes" with a dedicated type, distinct from string:
| Language | Bytes type | Notes |
|---|---|---|
| JavaScript | Uint8Array, Buffer (Node) |
Buffer is a Uint8Array subclass |
| Python | bytes (immutable), bytearray (mutable) |
Strict separation since 3.0 |
| Go | []byte |
Slice of byte (alias for uint8) |
| Rust | &[u8], Vec<u8> |
String and &str are guaranteed UTF-8 |
| Java | byte[] |
Plus ByteBuffer for richer ops |
| C# | byte[], Span<byte> |
Memory<T> for owned spans |
The rule that saves you: never let bytes and strings drift across an API boundary without an explicit encoding. Functions that take String should take String, not "bytes that might be UTF-8."
Endianness
Multi-byte values (a 32-bit integer, a 64-bit float) have to be laid out in some byte order. There are two conventions:
- Big-endian (network byte order): most-significant byte at the lowest address.
0xDEADBEEF→[0xDE, 0xAD, 0xBE, 0xEF]. - Little-endian (host byte order on x86 and ARM): least-significant byte at the lowest address.
0xDEADBEEF→[0xEF, 0xBE, 0xAD, 0xDE].
Within a single CPU it doesn't matter — register loads and stores use the same convention. It matters when bytes leave the machine: writing to disk, sending over a network, sharing across architectures.
const buf = new ArrayBuffer(4);
const view = new DataView(buf);
view.setUint32(0, 0xdeadbeef, false); // false = big-endian (default)
// bytes: de ad be ef
view.setUint32(0, 0xdeadbeef, true); // true = little-endian
// bytes: ef be ad de
The convention for wire protocols (TCP/IP headers, BSON, Protobuf for some types) is big-endian, called network byte order. Standard helpers — htonl / ntohl (C), socket.htons (Python), binary.BigEndian.Uint32 (Go), Uint32Array + DataView (JS) — make sure your code does the right thing regardless of which architecture you're on.
Buffers vs typed arrays vs views
In JS specifically, the model is layered:
ArrayBuffer // an opaque chunk of N raw bytes
├─ Uint8Array // interprets the buffer as N unsigned 8-bit ints
├─ Int32Array // interprets the buffer as N/4 signed 32-bit ints (host order)
├─ Float64Array // interprets the buffer as N/8 doubles
└─ DataView // explicit get/set with byte order + offset control
Multiple views can share the same ArrayBuffer. Modify through one, see the change through any other. This is how zero-copy parsing of binary protocols works — point a DataView at the bytes you received from a socket and read fields by offset.
function parsePacket(bytes: Uint8Array) {
const view = new DataView(bytes.buffer, bytes.byteOffset, bytes.byteLength);
return {
magic: view.getUint32(0, false), // big-endian
flags: view.getUint16(4, false),
payloadLen: view.getUint32(6, false),
payload: bytes.slice(10, 10 + view.getUint32(6, false)),
};
}
Slicing and zero-copy
Most byte-array APIs offer two operations that look similar:
- Slice / view (zero-copy) — returns a new object that points at the same underlying memory. Cheap; modifying the original mutates the slice.
- Copy — allocates fresh memory and copies the contents. Safe; expensive.
const orig = new Uint8Array([1, 2, 3, 4, 5]);
const sub = orig.subarray(1, 4); // zero-copy view: [2, 3, 4]
sub[0] = 99; // orig is now [1, 99, 3, 4, 5]
const copy = orig.slice(1, 4); // independent copy: [99, 3, 4]
copy[0] = 0; // orig unchanged
In Go, s[1:4] is a slice header pointing into the original backing array — same trap. In Python, slicing a bytes object always copies (because bytes is immutable); slicing a bytearray returns a copy too. Different languages, different defaults — read the docs of the type you're using.
Reading and writing structured binary
Three idiomatic patterns.
Field-at-a-time with a typed view:
view.setUint32(0, magic, false);
view.setUint16(4, flags, false);
view.setUint32(6, payloadLen, false);
Pack / unpack format strings (Python's struct, Lua's string.pack):
import struct
header = struct.pack("!IHI", magic, flags, payload_len)
m, f, p = struct.unpack("!IHI", data) # ! = network order, I = u32, H = u16
Code generation from a schema (Cap'n Proto, Flatbuffers, Protobuf):
protoc --go_out=. message.proto
# generated code reads / writes the binary format with no offsets to remember
For wire protocols you control, schema-based codegen is the safest. For everything else (existing protocols, on-disk formats, low-level performance), a typed view + manual offsets gets you the most control.
Common bugs
Treating bytes as a string. A 4-byte raw integer that happens to start with [0x00, 0x00, 0x00, 0x05] becomes "" in many string APIs (NUL-terminated). Truncation, silent corruption, hours of debugging.
Wrong byte order. A 32-bit integer 0x00000001 written little-endian shows up as 0x01000000 (16,777,216) when read big-endian. Common bug when porting between architectures or talking to a wire protocol the other side controls.
Sign bugs at the byte boundary. A C char may be signed or unsigned (implementation-defined!). Reading byte 0xff and printing it as a number gives 255 or -1. Java has only signed bytes; you have to mask with & 0xff to read it as 0–255.
Aliasing with typed arrays. Two Uint8Arrays that share a buffer will see each other's writes. Useful when you mean it; a footgun when you don't.
Slice vs copy. Returning a slice into a buffer the caller might recycle. The slice quietly contains different bytes a millisecond later. Document whether your API hands back a copy or a view.
Hex / base64 confusion. Buffer.from('deadbeef') reads it as ASCII text, not hex. Buffer.from('deadbeef', 'hex') is what you want.
Practical decisions
- For raw bytes in TypeScript / JS, prefer
Uint8Array+DataViewover Node'sBufferfor portable code. - For wire protocols you design, use a schema-based encoder (Protobuf, MessagePack, CBOR) — they handle byte order and versioning for you.
- For wire protocols someone else designed, use the language's standard manual encoder (
structin Python,encoding/binaryin Go,byteorderin Rust). - Always document the encoding at every API boundary that takes or returns bytes.
- When in doubt, write the byte sequence out as hex and verify it matches the spec by eye.