When a custom AI tool makes sense
Not every workflow calls for AI. An n8n flow or a simple script solves most repetitive tasks without the overhead of a language model. But there are situations where that does not work. Three signals that a custom AI tool is the right approach:
- ▸The task requires context: the correct output depends on information spread across the document or dataset, not on a fixed rule you can put in an if-statement.
- ▸The input varies too much for a regex or template: free text, PDF forms, emails in varying formats, images of receipts. Manual structuring is not feasible at scale.
- ▸The decisions are too granular for manual rules: routing support tickets to the right department based on content, tone, and urgency simultaneously, without a decision matrix being viable.
- ▸The volume is high but the work is too repetitive for a specialist: reviewing 200 quotes per week for missing clauses is work for an AI, not for a lawyer.
If any of these criteria apply, exploring a custom tool is worthwhile. Not to involve AI, but because it is the only approach that actually solves the problem.
When a custom AI tool does not make sense
AI is not an answer to every problem. I do not build tools when:
- ▸The problem can be solved with a regular query, a spreadsheet formula, or an API integration without a reasoning layer.
- ▸There is no business case: 'we want to use AI' without a concrete, measurable problem behind it.
- ▸The goal is generating marketing content without specific domain knowledge or internal data as input. Anyone can do that themselves with ChatGPT.
- ▸The data quality is insufficient to run a model on: garbage in, garbage out applies to even the most advanced models.
My discovery approach: audit before code
Before I write a single line of code, I map out the manual task. That does not take much time but prevents me from building the wrong tool. The steps:
- ▸Intake: describe the process step by step. What is the input, what is the desired output, what decisions are made in between?
- ▸Error analysis: where does it go wrong manually? Where do errors occur, steps get skipped, or things take longer than they should?
- ▸POC in 1 to 3 iterations: I build a working prototype on real data from your organisation. No demo data, no ideal cases.
- ▸Validation: you or a team member evaluates the output of the prototype. Only then do I decide whether the model is suitable or needs adjustment.
- ▸Iteration: after validation I adjust the prompts, the pipeline, or the architecture based on what was wrong. No big release after months, but small steps that demonstrably work.
This trajectory is deliberately short and concrete. If the prototype turns out to be useless in the first iteration, I stop there. That is more honest than continuing to develop on a flawed foundation.
Which models I use and why
Model choice depends on the problem, not on what is most discussed at any given moment. My standard considerations:
- ▸Claude's latest models (Anthropic): for tasks where reasoning over longer context matters, such as cross-referencing a document for inconsistencies or generating structured output with multiple dependent fields.
- ▸OpenAI's latest model (OpenAI): for applications where cost per token weighs more heavily and the task does not require complex reasoning chains. Widely used for high-volume classification tasks.
- ▸Local Llama variants (Ollama or vLLM): for situations where data must not leave the system. On-premise, no API calls to external servers, fully within your network or server.
- ▸Embeddings and vector store: for RAG applications where the model draws context from your own document base rather than its training data.
I do not align myself with a single provider. The model that best solves the problem at the lowest ongoing cost wins.
Architecture I typically use
Depending on the tool the stack can vary, but most custom AI tools I build share a similar foundation:
- ▸Frontend: Next.js for web interfaces, or no UI if the tool runs headless via API or cron.
- ▸Backend: FastAPI (Python) for pipelines with many AI calls, Node.js for integrations with existing systems.
- ▸State and storage: PostgreSQL for structured data and decision history, S3-compatible storage for files.
- ▸Vector store: pgvector in Postgres for straightforward RAG setups, or Qdrant for larger collections with more search options.
- ▸Orchestration: LangChain or direct API calls, depending on whether the complexity justifies a framework.
- ▸Logging: every AI decision is logged with input, output, and reason. That is not optional, it is standard.
-- Anonymous case A
Context-aware file editing at an SMB in the logistics sector
An SMB in the logistics sector processed daily batches of files where the same types of modifications were needed each time: read two or three specific fields, calculate or adjust a context-dependent value, save the file in a different format, and upload it to an external system. The complication was in the word 'context-dependent': the correct value depended on information elsewhere in the file, not on a fixed rule.
A regular script could not handle that. I built an AI bot that reads each file, extracts the relevant context, makes decisions about the modifications based on that context, applies the changes, converts the file to the target format, and uploads it. The bot logs every decision with its reason, so a team member can review what was done and where the bot was uncertain.
The bot runs fully on-premise. On unexpected input it sends a notification instead of silently failing. The team member who previously did this manually now uses that time for work that requires human judgement.
-- Anonymous case B
Local RAG over internal wiki at a knowledge organisation
An internal knowledge organisation had an extensive wiki with procedures, policy documents, and technical manuals. The problem: employees asked colleagues the same questions that were already answered in the wiki, because searching it was too slow and too inaccurate. The information was available but not findable.
I built a local RAG pipeline on a server within the organisation's network. All wiki documents are indexed as embeddings in a vector store. Through a simple chat interface an employee asks a question and receives an answer with direct references to the relevant pages. No data leaves the network: the model runs locally on a Llama variant via Ollama.
The system is not a replacement for the wiki but a layer on top of it. The wiki remains the source of truth. The RAG tool makes that source accessible without anyone needing to know exactly where something is stored.
What I do not do
There are things I deliberately keep out of scope:
- ▸Out-of-the-box ChatGPT API integrations where the only goal is having 'AI' on the website. That does not solve any problem.
- ▸AI features as a demo or pilot with no plan for production deployment. A prototype that ends up in a drawer is wasted money.
- ▸Training custom models on small datasets. Fine-tuning only makes sense with thousands of examples and a specific reason why a foundational model falls short.
- ▸Systems where the output cannot be verified. Every AI decision in production needs an audit trail. Without that, I do not go live.
Cost and scope
Custom AI tools vary considerably in size. A POC to validate whether the approach works is different from a production system with monitoring and logging. I provide a quote after the intake based on the actual scope.
On request
No fixed packages. The price depends on the complexity of the task, the number of iterations needed for validation, and whether you want a retainer for maintenance and monitoring after delivery.