Which AI models do you use?

That depends on the problem. For complex reasoning chains I use Claude's latest models. For high-volume classification tasks OpenAI's latest model is more cost-efficient. For situations where data must not leave the network I use local models via Ollama or vLLM. Model choice is not a preference but a trade-off between quality, cost, and data constraints.

Can the tool run fully on-premise?

Yes. If data sovereignty is a requirement I build the tool so no data goes to external servers. That means a local model, a local vector store, and an internal API. That is more complex but feasible. Model performance is slightly lower than with cloud models but acceptable for most use cases.

How do you measure the accuracy of the tool?

During validation I compare the prototype output with the expected output on a set of real examples. I calculate an error rate per decision type. That number determines whether the tool is production-ready. After going live the system logs every output so deviations remain visible.

How do you handle hallucinations?

By limiting the scope of the task and structuring the output. A model that is free to answer any way it likes hallucinates more than one that may only fill in a field in a JSON based on explicit context. I use structured outputs where possible and have the model indicate when it is uncertain, so those cases are handled separately.

What does the tool cost per month in production?

That depends on the volume and the model. With 80 files per day via a cloud model you are looking at tens of euros per month in API costs, depending on file size and token count. With a local model there are no per-token costs, only server capacity. I make this clear in the quote.

How long does the process take from intake to production?

A POC to validate the approach can be completed quickly. A full production system with logging, monitoring, and integrations takes longer. I only give a timeline after the intake when the scope is clear. I do not make promises about delivery times before I know what needs to be built.

Custom AI tools built for your specific business problem

When a custom AI tool makes sense

Not every workflow calls for AI. An n8n flow or a simple script solves most repetitive tasks without the overhead of a language model. But there are situations where that does not work. Three signals that a custom AI tool is the right approach:

▸The task requires context: the correct output depends on information spread across the document or dataset, not on a fixed rule you can put in an if-statement.
▸The input varies too much for a regex or template: free text, PDF forms, emails in varying formats, images of receipts. Manual structuring is not feasible at scale.
▸The decisions are too granular for manual rules: routing support tickets to the right department based on content, tone, and urgency simultaneously, without a decision matrix being viable.
▸The volume is high but the work is too repetitive for a specialist: reviewing 200 quotes per week for missing clauses is work for an AI, not for a lawyer.

If any of these criteria apply, exploring a custom tool is worthwhile. Not to involve AI, but because it is the only approach that actually solves the problem.

When a custom AI tool does not make sense

AI is not an answer to every problem. I do not build tools when:

▸The problem can be solved with a regular query, a spreadsheet formula, or an API integration without a reasoning layer.
▸There is no business case: 'we want to use AI' without a concrete, measurable problem behind it.
▸The goal is generating marketing content without specific domain knowledge or internal data as input. Anyone can do that themselves with ChatGPT.
▸The data quality is insufficient to run a model on: garbage in, garbage out applies to even the most advanced models.

My discovery approach: audit before code

Before I write a single line of code, I map out the manual task. That does not take much time but prevents me from building the wrong tool. The steps:

▸Intake: describe the process step by step. What is the input, what is the desired output, what decisions are made in between?
▸Error analysis: where does it go wrong manually? Where do errors occur, steps get skipped, or things take longer than they should?
▸POC in 1 to 3 iterations: I build a working prototype on real data from your organisation. No demo data, no ideal cases.
▸Validation: you or a team member evaluates the output of the prototype. Only then do I decide whether the model is suitable or needs adjustment.
▸Iteration: after validation I adjust the prompts, the pipeline, or the architecture based on what was wrong. No big release after months, but small steps that demonstrably work.

This trajectory is deliberately short and concrete. If the prototype turns out to be useless in the first iteration, I stop there. That is more honest than continuing to develop on a flawed foundation.

Which models I use and why

Model choice depends on the problem, not on what is most discussed at any given moment. My standard considerations:

▸Claude's latest models (Anthropic): for tasks where reasoning over longer context matters, such as cross-referencing a document for inconsistencies or generating structured output with multiple dependent fields.
▸OpenAI's latest model (OpenAI): for applications where cost per token weighs more heavily and the task does not require complex reasoning chains. Widely used for high-volume classification tasks.
▸Local Llama variants (Ollama or vLLM): for situations where data must not leave the system. On-premise, no API calls to external servers, fully within your network or server.
▸Embeddings and vector store: for RAG applications where the model draws context from your own document base rather than its training data.

I do not align myself with a single provider. The model that best solves the problem at the lowest ongoing cost wins.

Architecture I typically use

Depending on the tool the stack can vary, but most custom AI tools I build share a similar foundation:

▸Frontend: Next.js for web interfaces, or no UI if the tool runs headless via API or cron.
▸Backend: FastAPI (Python) for pipelines with many AI calls, Node.js for integrations with existing systems.
▸State and storage: PostgreSQL for structured data and decision history, S3-compatible storage for files.
▸Vector store: pgvector in Postgres for straightforward RAG setups, or Qdrant for larger collections with more search options.
▸Orchestration: LangChain or direct API calls, depending on whether the complexity justifies a framework.
▸Logging: every AI decision is logged with input, output, and reason. That is not optional, it is standard.

-- Anonymous case A

Context-aware file editing at an SMB in the logistics sector

An SMB in the logistics sector processed daily batches of files where the same types of modifications were needed each time: read two or three specific fields, calculate or adjust a context-dependent value, save the file in a different format, and upload it to an external system. The complication was in the word 'context-dependent': the correct value depended on information elsewhere in the file, not on a fixed rule.

A regular script could not handle that. I built an AI bot that reads each file, extracts the relevant context, makes decisions about the modifications based on that context, applies the changes, converts the file to the target format, and uploads it. The bot logs every decision with its reason, so a team member can review what was done and where the bot was uncertain.

80+

Files per day

Manual actions

Complete

Decision log

The bot runs fully on-premise. On unexpected input it sends a notification instead of silently failing. The team member who previously did this manually now uses that time for work that requires human judgement.

-- Anonymous case B

Local RAG over internal wiki at a knowledge organisation

An internal knowledge organisation had an extensive wiki with procedures, policy documents, and technical manuals. The problem: employees asked colleagues the same questions that were already answered in the wiki, because searching it was too slow and too inaccurate. The information was available but not findable.

I built a local RAG pipeline on a server within the organisation's network. All wiki documents are indexed as embeddings in a vector store. Through a simple chat interface an employee asks a question and receives an answer with direct references to the relevant pages. No data leaves the network: the model runs locally on a Llama variant via Ollama.

1,200+

Documents indexed

External API calls

< 3s

Search time per query

The system is not a replacement for the wiki but a layer on top of it. The wiki remains the source of truth. The RAG tool makes that source accessible without anyone needing to know exactly where something is stored.

What I do not do

There are things I deliberately keep out of scope:

▸Out-of-the-box ChatGPT API integrations where the only goal is having 'AI' on the website. That does not solve any problem.
▸AI features as a demo or pilot with no plan for production deployment. A prototype that ends up in a drawer is wasted money.
▸Training custom models on small datasets. Fine-tuning only makes sense with thousands of examples and a specific reason why a foundational model falls short.
▸Systems where the output cannot be verified. Every AI decision in production needs an audit trail. Without that, I do not go live.

Cost and scope

Custom AI tools vary considerably in size. A POC to validate whether the approach works is different from a production system with monitoring and logging. I provide a quote after the intake based on the actual scope.

On request

No fixed packages. The price depends on the complexity of the task, the number of iterations needed for validation, and whether you want a retainer for maintenance and monitoring after delivery.

Custom AI tools built for your specific problem