The OpenAI chat completions endpoint is the simplest LLM API: send a list of messages, get back the next assistant message. That's it. Everything else — streaming, function calling, JSON mode, vision, voice — is built on top of this. If you understand one HTTPS POST request, you understand 80% of what shipping an LLM app means.
Anatomy of a chat completion
A request is a JSON body with: `model` (e.g. 'gpt-4o-mini'), `messages` (list of `{role, content}` pairs with roles 'system', 'user', 'assistant'), and optional knobs like `temperature` (randomness 0-2), `max_tokens` (cap), `stream` (true to receive tokens as they generate).
POST https://api.openai.com/v1/chat/completions
Authorization: Bearer YOUR_API_KEY
{
'model': 'gpt-4o-mini',
'messages': [
{'role': 'system', 'content': 'You are concise.'},
{'role': 'user', 'content': 'What is 2+2?'}
],
'temperature': 0.7
}
→ {'choices': [{'message': {'role': 'assistant', 'content': '4'}}], 'usage': {...}}One HTTPS request, one JSON response
Token-based pricing
OpenAI bills per million tokens, with separate rates for input and output (output is usually 2-4× more expensive). 1000 tokens ≈ 750 English words. A typical chat turn is 50-200 input tokens + 50-500 output tokens. Watch the cost like you'd watch infrastructure spend — long contexts (RAG with many docs) and chatty users can rack up bills fast.
- gpt-4o-mini — fast and cheap, the default for most apps ($0.15/$0.60 per 1M tokens)
- gpt-4o — flagship, best quality, ~30× more expensive
- o1-preview — reasoning model, slow but solves harder problems
- Add `stream=True` to receive tokens incrementally — feels like typing
- Add `response_format={'type': 'json_object'}` for guaranteed JSON
- Use `tools` parameter to enable function calling