Foreign Function Interfaces
Calling C from Python, Node, Java, Go — and the cost at the boundary.
Foreign Function Interfaces
Every popular language eventually grows a way to call out to C. Numerical kernels (BLAS, LAPACK), image and video codecs (libjpeg, libvpx), encryption (OpenSSL, libsodium), database drivers (libpq), and most of POSIX live in C libraries that pre-date your language by decades. The bridge between your runtime and those libraries is the Foreign Function Interface — and crossing it is the most expensive thing you can do in a tight loop if you don't know what you're doing.
Analogy
Think of customs at an international border. The fastest way to move a single grape across is to walk through customs holding the grape. The fastest way to move ten thousand grapes is not to walk back and forth ten thousand times — it's to load them all into one truck, declare them all in one form, and cross once. FFI calls work the same way: per-call overhead dominates if you cross often, vanishes if you batch. The grape is your data; the truck is a vector / array / batch API; customs is the runtime boundary.
What "the boundary" actually costs
A managed runtime (Python, Node, JVM) maintains conventions that C doesn't share — a garbage collector, an object header, exception state, a green-thread scheduler, possibly a JIT mid-compilation. When you call a C function:
- Lock the runtime state. GIL in CPython, JS lock in Node, GC root tracking in JVM.
- Marshal arguments. Convert managed objects to C-friendly types: Python
str→char*UTF-8, JavaString→jstring, JSNumber→double. - Save GC roots. So a GC during the C call doesn't move objects under our feet.
- Cross. Actual native-to-native call.
- Marshal return values. Convert back.
- Unlock. Re-enter the managed world.
Steps 2 and 5 — marshalling — are usually the bulk of the cost. A simple int add(int, int) round-trip from Python via ctypes is roughly 300 ns. From Node via N-API, roughly 100–200 ns. From Go via cgo, roughly 100–500 ns. The more complex the type, the more it costs.
The standard bridges
| Host language | Standard FFI | Notes |
|---|---|---|
| Python | ctypes (built-in), cffi |
cffi is more modern, easier on memory layout |
| Node.js | N-API (a.k.a. node-api) | Stable ABI; old NAN / V8 internals are deprecated |
| Java | JNI | Verbose; Project Panama (Foreign Linker API) is the modern alternative |
| Go | cgo | Easy syntactically, but each call switches goroutine stack — measure |
| Rust | extern "C" + bindgen |
bindgen auto-generates safe wrappers from C headers |
| C# / .NET | P/Invoke | [DllImport] attribute; very low overhead, often under 50 ns |
Each one looks superficially the same — a little glue code that names a foreign function and types — but the constraints differ wildly.
Examples
# Python via cffi (preferred over ctypes for new code).
from cffi import FFI
ffi = FFI()
ffi.cdef("""
int sodium_init(void);
int crypto_aead_xchacha20poly1305_ietf_encrypt(
unsigned char *c, unsigned long long *clen,
const unsigned char *m, unsigned long long mlen,
const unsigned char *ad, unsigned long long adlen,
const unsigned char *nsec,
const unsigned char *npub,
const unsigned char *k);
""")
lib = ffi.dlopen("/usr/local/lib/libsodium.so")
lib.sodium_init()
// Node.js via N-API — usually you don't write the C++ by hand,
// you use a higher-level library like `node-bindgen` or `napi-rs`.
import { add } from "./build/Release/addon.node";
console.log(add(40, 2)); // 42
// Go via cgo — beware of goroutine stack switches.
// #include <stdlib.h>
// #include "fast.h"
import "C"
func ComputeHash(data []byte) uint32 {
// unsafe, but typical FFI shape:
return uint32(C.compute_hash((*C.char)(unsafe.Pointer(&data[0])), C.int(len(data))))
}
// Rust via extern "C" — bindgen typically generates the prototype.
#[link(name = "fast")]
extern "C" {
fn compute_hash(data: *const u8, len: usize) -> u32;
}
pub fn hash(data: &[u8]) -> u32 {
unsafe { compute_hash(data.as_ptr(), data.len()) }
}
Marshalling, in detail
When the host language and C disagree on representation, somebody has to translate.
| Type pair | What happens |
|---|---|
Python str ↔ C char* |
Re-encoded UTF-8 each direction (allocation!) |
Node Buffer ↔ C void* |
Pointer pass-through, free |
Java String ↔ C jstring |
UTF-16 ↔ UTF-8 conversion or zero-copy modified UTF-8 |
Go string ↔ C *C.char |
Allocation + null termination |
Numpy ndarray ↔ C double* |
Pointer pass-through if contiguous, else copy |
Numerical libraries lean heavily on Numpy / Arrow precisely because the array is already in a C-compatible layout. No marshalling, full speed.
The batching rule
If you find yourself FFI-calling process_one(item) in a loop of n items, your throughput is bounded by the boundary cost. Replace it with process_many(items, n) and the cost amortises across n:
# Bad: one boundary crossing per item, ~100 ns × 10⁶ = 100 ms wasted on customs
for row in rows:
lib.transform(row)
# Good: one crossing for the whole array, almost free
lib.transform_batch(rows.ctypes.data_as(POINTER(c_double)), len(rows))
This is why Numpy and Pandas exist. Almost every fast Python data tool is just "Numpy at the bottom plus a thin Python API."
When FFI is the right answer
- Numerics: matrix math, FFTs, convolutions. Use Numpy / SciPy / PyTorch (which themselves use BLAS / LAPACK / cuBLAS via FFI).
- Crypto: never roll your own. libsodium, OpenSSL, BoringSSL — call the library.
- Multimedia: image / video / audio decoding. libjpeg-turbo, libvpx, ffmpeg.
- System APIs: posix, win32. Languages with no native binding for an obscure syscall reach for FFI.
- Vendor SDKs: hardware drivers, TPMs, smartcards.
When FFI is the wrong answer
- The library is a few hundred lines of pure logic. Just rewrite it.
- You're calling 100 short functions per second. The overhead's not a problem; complexity of the binding is.
- The library has a dynamic memory model that fights yours (callbacks into the host language, finalizers in C). The bookkeeping is a bug factory.
What goes wrong
Lifetime confusion. The C function holds a pointer to a buffer the host GC then frees. Now C reads garbage. Solutions: pin the buffer (Numpy.ascontiguousarray, GC roots), or copy.
Threading mismatch. A C library expects to be called from one thread; you call it from the JS event loop. Solutions: lock around the FFI call, or use a worker thread.
Exception leak. A C function returns -1 on error; the host language doesn't know to raise. Solutions: every binding wraps the FFI call and raises on error.
ABI drift. A C library updates and changes a struct layout. Your bindings, generated against the old header, segfault. Solutions: regenerate bindings on each release, run nightly fuzz tests.
Silent UTF-8 corruption. Java's modified UTF-8 (jstring) isn't standard UTF-8. Round-trip a 4-byte emoji and it's 6 bytes when it comes back. Solutions: use the explicit GetStringUTFRegion family or convert to bytes yourself.
Practical decisions
- For Python, prefer
cffioverctypesfor new bindings. Usepybind11if you control the C++ side. - For Node, use
napi-rs(Rust → N-API) ornode-bindgen. Avoid raw N-API unless you must. - For Go, profile every cgo call. Replace tight loops with batched calls.
- For Rust, generate bindings with
bindgen; wrap them in safe abstractions. - Always benchmark before and after introducing FFI. The fast C library may not actually be the bottleneck.