AI / ML Developer — LLM Deployment & Optimization
Design, deploy, and maintain robust self-hosted large language model systems optimized for enterprise production tasks, covering the full lifecycle from fine-tuning to low-latency inference.
About Contezy
Contezy engineers scalable web experiences and AI-driven systems that power automation and critical decision-making at scale. Our focus is on reproducibility, cost-effective performance, and the secure, production-grade deployment of machine learning infrastructure.
Role Summary
The AI / ML Developer will be responsible for architecting and maintaining our proprietary self-hosted LLM systems. This includes developing solutions for retrieval-augmented generation (RAG), custom task-specific assistants, knowledge indexing, and optimizing real-time inference. You will manage the ML lifecycle across model selection, high-quality dataset engineering, fine-tuning, deployment, and performance monitoring pipelines.
Key Responsibilities
- Systematically select, benchmark, and evaluate LLMs based on critical performance, latency, and cost trade-offs for production use.
- Fine-tune and adapt foundational models using techniques like Supervised Fine-Tuning (SFT), LoRA, or PEFT with curated, clean datasets.
- Design and implement sophisticated Retrieval-Augmented Generation (RAG) pipelines utilizing vector stores, embedding models, and efficient indexing workflows.
- Develop scalable model-serving APIs and inference systems, including managing multi-GPU configurations, serving quantized models, and implementing batching strategies.
- Containerize models and deploy end-to-end ML workflows using Docker and Kubernetes within CI/CD pipelines.
- Optimize model inference performance using advanced techniques like quantization, ONNX conversion, and specialized accelerated runtimes (e.g., Triton, TensorRT).
- Instrument comprehensive observability and performance metrics, including detailed latency, throughput, and operational cost monitoring.
- Collaborate with cross-functional engineering teams to seamlessly integrate LLM services into core production systems.
Required Qualifications
- 2+ years professional experience in AI/ML engineering, with substantial, hands-on experience in LLM deployment and serving.
- Expertise in Python, PyTorch, and developing robust, production-ready ML pipelines.
- Direct experience with self-hosting LLMs, model serving frameworks (e.g., vLLM, Text-Generation-Inference), and advanced GPU optimization.
- Strong proficiency in containerization (Docker/Kubernetes) and developing high-performance backend APIs (FastAPI/Flask).
- Knowledge of vector databases (e.g., FAISS, Milvus, Pinecone) and effective retrieval strategies for RAG systems.
- Familiarity with quantization methods, LoRA fine-tuning, and deployment optimization strategies aimed at cost efficiency.
Preferred Skills
- Extensive experience with the Hugging Face ecosystem (Transformers, Accelerate) and orchestration tools like LangChain or LlamaIndex.
- In-depth knowledge of open foundational models (Llama, Mistral, Falcon, Mixtral, etc.) and quantized inference frameworks (GGUF, bitsandbytes, llama.cpp).
- Exposure to advanced MLOps practices, model observability platforms, and establishing reproducible training and deployment workflows.
- Documented experience fine-tuning or comprehensively benchmarking foundation models on proprietary custom datasets.