Skip to main content
TIEMAN.IT

Integrate the OpenAI API into your own application

OpenAI's latest model, OpenAI's lighter model, embeddings and function calling work well. But a production-worthy integration is more than firing an API call. I build OpenAI integrations with token budgets, retry logic, prompt versioning and output evaluation built in from day one.

What the OpenAI API can do in your application

The OpenAI API is a collection of models and capabilities accessible as an HTTP service. Chat completions are the most well-known feature, but the API goes beyond text generation. Each component has specific use cases where it performs best.

  • Chat completions (OpenAI's latest models): generate text, summarize, classify, reason over structured input. OpenAI's lighter model costs /bin/zsh.15 per million input tokens and suits high-volume tasks like classification and extraction. OpenAI's latest model (.50 per million input tokens) is better for complex reasoning and accuracy.
  • Function calling: the model invokes structured functions in your own code based on user input. This lets you build AI agents that take action, not just return text.
  • Structured output: the model always returns valid JSON conforming to a schema you define. No parsing errors, no hallucinated fields.
  • Embeddings (text-embedding-3-small, text-embedding-3-large): convert text into vector representations for semantic search, recommendation systems and clustering. text-embedding-3-small costs /bin/zsh.02 per million tokens and is sufficient for most use cases.
  • Vision: OpenAI's latest model can analyze images, read documents and combine visual information with textual context.
  • Audio (Whisper, TTS): speech-to-text transcription and text-to-speech for voice-driven interfaces or automated processing of audio recordings.
  • Fine-tuning: train a base model on your own data so it consistently applies your style, jargon or classification schema.

Which capabilities are relevant depends on your use case. I start with an analysis of the problem, then choose the model and features that deliver the best price-to-quality ratio.

GPT versus Claude: strengths per scenario

OpenAI and Anthropic (Claude) are the two dominant API providers. I work with both. The choice depends on the specific requirements of your project, not on loyalty to a provider.

  • OpenAI's lighter model for high-volume, cost-sensitive tasks: classification, extraction, summarization on large numbers of records. The per-token price is low enough for batch processing thousands of documents per day.
  • OpenAI's latest model for multimodal applications: you need images, audio or complex documents in the same request as text.
  • reasoning model / reasoning model for reasoning-intensive tasks: mathematical problem solving, code generation, planning logic where demonstrable reasoning is required.
  • Claude (Sonnet, Opus) for precise instruction following, long context windows and scenarios where avoiding hallucinations outweighs cost.
  • Hybrid architecture: for some systems I use OpenAI's lighter model as a first filter (cheap and fast) and Claude or OpenAI's latest model as a second layer for complex cases the first layer flags.

I have no financial interest in any particular provider. The recommendation is always based on the combination of accuracy requirements, volume, cost per request and the existing architecture of the project.

Production-grade integration: what it really takes

Writing a simple OpenAI API call takes ten minutes. An integration that runs reliably in production, keeps costs manageable and handles errors without users noticing is a different matter.

  • Token budgets per request: I set max_tokens per request type and adjust prompt structure to limit input tokens. This prevents a large input from unexpectedly generating a bill of tens of dollars.
  • Retry logic with exponential backoff: OpenAI rate limits hit every integration during peaks. I implement automatic retries with backoff and jitter so temporary errors (429 responses) are handled transparently.
  • Prompt caching: for requests where a large system prompt or contextual document is repeatedly included, I cache the prompt prefix server-side. OpenAI offers automatic caching for repeated prompt prefixes, which can reduce costs by up to 50% at high volumes.
  • Streaming for user experience: for interfaces where the user waits for a response, I stream the response via Server-Sent Events. The first tokens appear immediately, greatly improving the perceived speed.
  • Output validation: even with structured output mode I validate the result with a schema validator (Zod in TypeScript, Pydantic in Python) before the system continues. Corrupt output is detected and optionally re-requested automatically.
  • Prompt versioning: prompts are code, not inline strings. I store them as files in Git, version them and test new versions on a subset of data before rolling out to production.
  • Cost monitoring: I log token usage per request, per user or per pipeline step so you can see where the API bill goes and adjust accordingly.

This is the infrastructure that makes the difference between a proof of concept and a system you dare to let your customers use.

Implementing function calling correctly

Function calling is one of the most-used features of the OpenAI API and also one of the most misused. The model determines when to invoke a function based on user input. If the function definition is wrong, the model calls the wrong function or fills in parameters that do not match.

A correct implementation requires precise JSON Schema definitions per function, clear descriptions of parameters and unambiguous instructions in the system prompt about when each function is appropriate.

  • JSON Schema per function: each parameter has a type, a description and optionally enum values or constraints. Without a description the model does not know how to fill parameters correctly.
  • Parallel function calling: OpenAI's latest model supports calling multiple functions simultaneously in one response. Use this only when the functions are independent. With dependencies you must work sequentially.
  • Structured output mode as alternative: for cases where you always want a specific JSON object returned without function-calling overhead, Structured Output (response_format: json_schema) is the cleaner solution.
  • Steering tool choice: with tool_choice you can force the model to choose a specific function or prevent it from calling functions at all. Useful for controlled flows.
  • Error handling in function calls: if the model calls a function with invalid parameters, I return an error message as the function result so the model can correct the call.

Embeddings and vector search: building semantic search

Embeddings are the foundation of semantic search: the ability to search by meaning rather than exact keywords. An embedding is a numerical vector representing the semantic content of a piece of text. Two sentences with similar meaning lie close together in vector space, even if they share no words.

OpenAI's text-embedding-3-small generates vectors of 1536 dimensions and costs /bin/zsh.02 per million tokens. text-embedding-3-large generates 3072-dimension vectors and is more accurate for subtle semantic differences, but costs more and requires greater storage capacity.

  • Postgres with pgvector: for most SMB use cases, pgvector in an existing Postgres database is sufficient. It avoids an extra external service and fits directly into your existing data model.
  • Specialized vector databases (Qdrant, Pinecone, Weaviate): worthwhile at millions of vectors, or when you need advanced filtering, hybrid search (vector + keyword) or scalable index updates.
  • Chunking strategy: I split long documents into overlapping segments of 200 to 500 tokens per chunk. The overlap ensures context around segment boundaries is not lost.
  • Reindexing on changes: when the embedding model or chunking strategy changes, all existing vectors are invalid. I build reindexing into the deployment pipeline.
  • Hybrid search: for better recall I combine vector search with classic keyword search (BM25 or tsvector in Postgres). Reciprocal Rank Fusion combines the results of both methods.

Evaluation: knowing your output quality

An AI integration without evaluation is a black box. You do not know whether the model is giving the right answers, whether prompt changes bring improvement, or whether quality degrades over time due to OpenAI model updates.

I build evaluation in as part of the integration, not as an afterthought. The approach depends on the task.

  • Labeled test set: for classification and extraction tasks I assemble a set of one hundred to five hundred labeled examples. Every prompt change is tested on this set before rollout.
  • LLM-as-judge: for open-ended generation tasks (summarization, advice, answering questions) I use a second model as evaluator. It assesses the output on accuracy, completeness and style according to a predefined rubric.
  • A/B testing prompts: new prompt versions are first rolled out to a small percentage of requests. Token usage, latency and the LLM-as-judge score are the metrics.
  • Drift monitoring: OpenAI regularly releases model updates. I monitor evaluation metrics continuously so a silent degradation becomes visible before it becomes a problem.
  • Cost per request logging: for every API call I log tokens used and calculate the cost. This makes it possible to see which use cases are expensive and where optimization makes sense.

-- Anonymous case study

SMB automates customer ticket classification with OpenAI's lighter model

A company with a customer service department received one hundred to one hundred and fifty support tickets per day by email. The tickets had to be manually categorized (billing question, technical issue, delivery complaint, cancellation) and assigned to the right staff member. This took an average of five minutes per ticket.

I built an integration where each incoming ticket is processed via the OpenAI API using OpenAI's lighter model. The model classifies the ticket into one of the four categories, estimates urgency based on tone and content, and writes a draft reply as a starting point for the staff member. Classification is correct in 94 out of 100 cases on the test set. Cost per ticket: /bin/zsh.003.

94%
Classification accuracy
/bin/zsh.003
Cost per ticket
150+
Tickets processed per day

When a staff member opens a ticket they already see the category, the urgency assessment and a draft reply. Corrections to the classification are logged and form the basis for the next iteration of the labeled test set.

What I do not do

Clarity about the boundaries of my approach is part of good advice.

  • Using OpenAI's latest model where OpenAI's lighter model is sufficient. If the task is classification or extraction on structured input, the more expensive model is rarely better. I test with the cheaper model first.
  • Not setting up an evaluation pipeline. An AI system without evaluation is a risk. I always build in a baseline test set, even if it is small.
  • Hard-coding prompts without versioning. Prompts are code. They belong in Git, with diffs and a deployment process.
  • Not building in cost monitoring. OpenAI bills can grow quickly at production volumes. I want you to see what it costs per request, per day, per user.
  • Delivering an integration without documentation of the architectural decisions. You need to understand in six months why certain choices were made.

Pricing

The cost of an OpenAI API integration depends on the scope: the number of endpoints, the complexity of the function-calling logic, the evaluation setup and any embedding pipeline. I always provide a fixed price estimate after an intake conversation.

On request

OpenAI API costs (tokens, API calls) are outside my fee and are invoiced directly by OpenAI on your account.

Related services

-- Veelgestelde vragen

Heb je een vraag?

I work with OpenAI's lighter model for high-volume and cost-sensitive tasks, OpenAI's latest model for multimodal or more complex reasoning, and reasoning model/reasoning model for intensive reasoning tasks. The choice is based on accuracy requirements and budget, not on the prestige of the model name.
I implement retry logic with exponential backoff and jitter as standard. At high volumes I evaluate whether it makes sense to distribute requests across multiple API keys or organizations, or to build a queue that caps throughput below the rate-limit threshold.
Yes, OpenAI supports fine-tuning for OpenAI's lighter model and GPT-3.5-turbo. Fine-tuning makes sense when you need a consistent style, specific jargon or a custom classification schema that you cannot achieve through prompting. The training data must be labeled. I help with both composition and evaluation.
I always assemble a test set with labeled examples. After every prompt change I re-run the evaluation and compare scores. For open-ended generation tasks I use LLM-as-judge: a second model that assesses the output on accuracy and completeness according to a rubric.
OpenAI does not use data sent via the API to train their models, provided you use the commercial API and a data processing agreement is in place. For situations with stricter GDPR requirements or personal data that must not leave the EU, Azure OpenAI Service is an option: the same models, but hosted in an EU datacenter under your own Azure tenant.
Yes. Azure OpenAI offers the same models as the direct OpenAI API, but hosted in an Azure region of your choice. It makes sense when you are already on Azure, require GDPR sovereignty, or need Private Endpoints so API traffic does not go over the public internet.

Ready to integrate the OpenAI API into your application?

After a short conversation I can tell you which model fits your use case best and what a realistic implementation costs.