Building RAG Systems That Don't Hallucinate
Retrieval-augmented generation is easy to demo and hard to trust. Here is what separates a toy from a system you can put in front of customers.
Practical writing on applied AI, GPU compute, platforms and the craft of shipping software that survives contact with production.
Retrieval-augmented generation is easy to demo and hard to trust. Here is what separates a toy from a system you can put in front of customers.
Agent demos run a perfect path once. Production agents face the other 200 paths. Here is how to design for the ones the demo never showed.
Do you need a dedicated vector database, or is your existing one enough? A practical look at what these systems actually do and when they earn their keep.
Fine-tuning feels like the serious option. Most of the time it is the expensive answer to a question prompting already solved. Here is how to tell them apart.
You cannot ship what you cannot measure. Evaluating generative systems is harder than traditional software testing — and skipping it is how good demos become bad products.
GPU sticker prices get the headlines, but the bill that matters is utilisation, power, and idle time. A field guide to what AI compute really costs.
The race for ever-larger models grabbed the headlines. The more consequential trend may be the opposite: small models good enough to run on a phone.
A model that works in a notebook is a science project. A model that serves real traffic reliably is an engineering system. Bridging the two is what MLOps is for.
When your application takes instructions in plain language, attackers can write instructions too. Prompt injection is the vulnerability class that traditional security never prepared us for.
Quantization shrinks a model by storing its numbers with less precision. Done well, it cuts memory and cost dramatically while barely touching quality. Here is the intuition and the tradeoffs.