AI / ML Developer — LLMs & Self-Hosting
Design, deploy, and maintain self-hosted large language model systems optimized for production tasks — from dataset curation and fine-tuning to API configuration, inference optimization, and monitoring.
About Contezy
Contezy builds scalable web experiences and AI-driven systems empowering automation and decision-making at scale. We emphasize reproducibility, cost-effective performance, and secure deployment of machine learning infrastructure.
Position Summary
The AI / ML Developer will architect and maintain self-hosted LLM systems for retrieval-augmented generation, task-specific assistants, knowledge indexing, and real-time inference. You’ll work across model selection, fine-tuning, dataset engineering, deployment, and monitoring pipelines.
Key Responsibilities
- Select and benchmark LLMs based on performance, latency, and cost trade-offs.
- Fine-tune and adapt models via supervised fine-tuning, LoRA, or PEFT using curated datasets.
- Implement retrieval-augmented generation (RAG) with vector stores and embedding workflows.
- Develop scalable model-serving APIs and inference systems (multi-GPU, quantized models, batching).
- Containerize and deploy models using Docker and Kubernetes with CI/CD workflows.
- Optimize inference performance with quantization, ONNX, and accelerated runtimes.
- Instrument observability and performance metrics: latency, throughput, and cost monitoring.
- Collaborate with cross-functional teams to integrate models into production systems.
Required Skills & Experience
- 2+ years professional experience in AI/ML engineering, with hands-on LLM deployment.
- Expertise in Python, PyTorch, and ML pipeline development.
- Experience with self-hosting LLMs, model serving (vLLM, Text-Generation-Inference, etc.), and GPU optimization.
- Strong understanding of containerization (Docker/Kubernetes) and backend APIs (FastAPI/Flask).
- Knowledge of vector databases (FAISS, Milvus, Pinecone) and retrieval strategies.
- Familiarity with quantization, LoRA fine-tuning, and deployment optimization for cost efficiency.
Preferred Skills
- Experience with Hugging Face Transformers, Accelerate, and LangChain.
- Knowledge of open models (Llama, Mistral, Falcon, Mixtral, etc.) and quantized inference frameworks (GGUF, bitsandbytes, llama.cpp).
- Exposure to MLOps, model observability, and reproducible training workflows.
- Experience fine-tuning or benchmarking foundation models on custom datasets.