runtimes · level 7

Foreign Function Interfaces

Calling C from Python, Node, Java, Go — and the cost at the boundary.

220 XP

Foreign Function Interfaces

Every popular language eventually grows a way to call out to C. Numerical kernels (BLAS, LAPACK), image and video codecs (libjpeg, libvpx), encryption (OpenSSL, libsodium), database drivers (libpq), and most of POSIX live in C libraries that pre-date your language by decades. The bridge between your runtime and those libraries is the Foreign Function Interface — and crossing it is the most expensive thing you can do in a tight loop if you don't know what you're doing.

Analogy

Think of customs at an international border. The fastest way to move a single grape across is to walk through customs holding the grape. The fastest way to move ten thousand grapes is not to walk back and forth ten thousand times — it's to load them all into one truck, declare them all in one form, and cross once. FFI calls work the same way: per-call overhead dominates if you cross often, vanishes if you batch. The grape is your data; the truck is a vector / array / batch API; customs is the runtime boundary.

What "the boundary" actually costs

A managed runtime (Python, Node, JVM) maintains conventions that C doesn't share — a garbage collector, an object header, exception state, a green-thread scheduler, possibly a JIT mid-compilation. When you call a C function:

  1. Lock the runtime state. GIL in CPython, JS lock in Node, GC root tracking in JVM.
  2. Marshal arguments. Convert managed objects to C-friendly types: Python strchar* UTF-8, Java Stringjstring, JS Numberdouble.
  3. Save GC roots. So a GC during the C call doesn't move objects under our feet.
  4. Cross. Actual native-to-native call.
  5. Marshal return values. Convert back.
  6. Unlock. Re-enter the managed world.

Steps 2 and 5 — marshalling — are usually the bulk of the cost. A simple int add(int, int) round-trip from Python via ctypes is roughly 300 ns. From Node via N-API, roughly 100–200 ns. From Go via cgo, roughly 100–500 ns. The more complex the type, the more it costs.

The standard bridges

Host language Standard FFI Notes
Python ctypes (built-in), cffi cffi is more modern, easier on memory layout
Node.js N-API (a.k.a. node-api) Stable ABI; old NAN / V8 internals are deprecated
Java JNI Verbose; Project Panama (Foreign Linker API) is the modern alternative
Go cgo Easy syntactically, but each call switches goroutine stack — measure
Rust extern "C" + bindgen bindgen auto-generates safe wrappers from C headers
C# / .NET P/Invoke [DllImport] attribute; very low overhead, often under 50 ns

Each one looks superficially the same — a little glue code that names a foreign function and types — but the constraints differ wildly.

Examples

# Python via cffi (preferred over ctypes for new code).
from cffi import FFI

ffi = FFI()
ffi.cdef("""
    int sodium_init(void);
    int crypto_aead_xchacha20poly1305_ietf_encrypt(
        unsigned char *c, unsigned long long *clen,
        const unsigned char *m, unsigned long long mlen,
        const unsigned char *ad, unsigned long long adlen,
        const unsigned char *nsec,
        const unsigned char *npub,
        const unsigned char *k);
""")

lib = ffi.dlopen("/usr/local/lib/libsodium.so")
lib.sodium_init()
// Node.js via N-API — usually you don't write the C++ by hand,
// you use a higher-level library like `node-bindgen` or `napi-rs`.
import { add } from "./build/Release/addon.node";
console.log(add(40, 2)); // 42
// Go via cgo — beware of goroutine stack switches.
// #include <stdlib.h>
// #include "fast.h"
import "C"

func ComputeHash(data []byte) uint32 {
    // unsafe, but typical FFI shape:
    return uint32(C.compute_hash((*C.char)(unsafe.Pointer(&data[0])), C.int(len(data))))
}
// Rust via extern "C" — bindgen typically generates the prototype.
#[link(name = "fast")]
extern "C" {
    fn compute_hash(data: *const u8, len: usize) -> u32;
}

pub fn hash(data: &[u8]) -> u32 {
    unsafe { compute_hash(data.as_ptr(), data.len()) }
}

Marshalling, in detail

When the host language and C disagree on representation, somebody has to translate.

Type pair What happens
Python str ↔ C char* Re-encoded UTF-8 each direction (allocation!)
Node Buffer ↔ C void* Pointer pass-through, free
Java String ↔ C jstring UTF-16 ↔ UTF-8 conversion or zero-copy modified UTF-8
Go string ↔ C *C.char Allocation + null termination
Numpy ndarray ↔ C double* Pointer pass-through if contiguous, else copy

Numerical libraries lean heavily on Numpy / Arrow precisely because the array is already in a C-compatible layout. No marshalling, full speed.

The batching rule

If you find yourself FFI-calling process_one(item) in a loop of n items, your throughput is bounded by the boundary cost. Replace it with process_many(items, n) and the cost amortises across n:

# Bad: one boundary crossing per item, ~100 ns × 10⁶ = 100 ms wasted on customs
for row in rows:
    lib.transform(row)

# Good: one crossing for the whole array, almost free
lib.transform_batch(rows.ctypes.data_as(POINTER(c_double)), len(rows))

This is why Numpy and Pandas exist. Almost every fast Python data tool is just "Numpy at the bottom plus a thin Python API."

When FFI is the right answer

  • Numerics: matrix math, FFTs, convolutions. Use Numpy / SciPy / PyTorch (which themselves use BLAS / LAPACK / cuBLAS via FFI).
  • Crypto: never roll your own. libsodium, OpenSSL, BoringSSL — call the library.
  • Multimedia: image / video / audio decoding. libjpeg-turbo, libvpx, ffmpeg.
  • System APIs: posix, win32. Languages with no native binding for an obscure syscall reach for FFI.
  • Vendor SDKs: hardware drivers, TPMs, smartcards.

When FFI is the wrong answer

  • The library is a few hundred lines of pure logic. Just rewrite it.
  • You're calling 100 short functions per second. The overhead's not a problem; complexity of the binding is.
  • The library has a dynamic memory model that fights yours (callbacks into the host language, finalizers in C). The bookkeeping is a bug factory.

What goes wrong

Lifetime confusion. The C function holds a pointer to a buffer the host GC then frees. Now C reads garbage. Solutions: pin the buffer (Numpy.ascontiguousarray, GC roots), or copy.

Threading mismatch. A C library expects to be called from one thread; you call it from the JS event loop. Solutions: lock around the FFI call, or use a worker thread.

Exception leak. A C function returns -1 on error; the host language doesn't know to raise. Solutions: every binding wraps the FFI call and raises on error.

ABI drift. A C library updates and changes a struct layout. Your bindings, generated against the old header, segfault. Solutions: regenerate bindings on each release, run nightly fuzz tests.

Silent UTF-8 corruption. Java's modified UTF-8 (jstring) isn't standard UTF-8. Round-trip a 4-byte emoji and it's 6 bytes when it comes back. Solutions: use the explicit GetStringUTFRegion family or convert to bytes yourself.

Practical decisions

  • For Python, prefer cffi over ctypes for new bindings. Use pybind11 if you control the C++ side.
  • For Node, use napi-rs (Rust → N-API) or node-bindgen. Avoid raw N-API unless you must.
  • For Go, profile every cgo call. Replace tight loops with batched calls.
  • For Rust, generate bindings with bindgen; wrap them in safe abstractions.
  • Always benchmark before and after introducing FFI. The fast C library may not actually be the bottleneck.