Edge deployment runs inference on devices close to the user or sensor: phones, browsers, cameras, factories, vehicles, kiosks, or embedded boards. The model must fit tight limits for memory, compute, power, latency, privacy, and sometimes offline operation.
Optimization tools
- Quantization: store weights and activations in lower precision such as int8 or int4 to reduce size and speed up inference.
- Pruning: remove weights, channels, or heads that contribute little to output quality.
- Distillation: train a small student model to mimic a larger teacher model.
- Compilation: convert the model to a runtime such as ONNX Runtime, TensorRT, Core ML, TFLite, WebGPU, or a vendor SDK.
- Batching and caching: reuse repeated work when latency budget allows.
- Telemetry: report health, latency, failure rates, and coarse drift without sending private raw data unnecessarily.
edge score = quality / (latency + memory + power + update complexity)
Typical trade:
420 MB fp32 model -> 28 MB int8 model
94% accuracy -> 92.8% accuracy
280 ms cloud round trip -> 22 ms local inferenceThe best edge model is the one that meets the product constraint.