AI / ML Developer — LLMs & Self-Hosting

Design, deploy, and maintain self-hosted large language model systems optimized for production tasks — from dataset curation and fine-tuning to API configuration, inference optimization, and monitoring.

Full-time • Remote / Hybrid Min 2 years experience Contezy — AI & Web Infrastructure

About Contezy

Contezy builds scalable web experiences and AI-driven systems empowering automation and decision-making at scale. We emphasize reproducibility, cost-effective performance, and secure deployment of machine learning infrastructure.

Position Summary

The AI / ML Developer will architect and maintain self-hosted LLM systems for retrieval-augmented generation, task-specific assistants, knowledge indexing, and real-time inference. You’ll work across model selection, fine-tuning, dataset engineering, deployment, and monitoring pipelines.

Key Responsibilities

  • Select and benchmark LLMs based on performance, latency, and cost trade-offs.
  • Fine-tune and adapt models via supervised fine-tuning, LoRA, or PEFT using curated datasets.
  • Implement retrieval-augmented generation (RAG) with vector stores and embedding workflows.
  • Develop scalable model-serving APIs and inference systems (multi-GPU, quantized models, batching).
  • Containerize and deploy models using Docker and Kubernetes with CI/CD workflows.
  • Optimize inference performance with quantization, ONNX, and accelerated runtimes.
  • Instrument observability and performance metrics: latency, throughput, and cost monitoring.
  • Collaborate with cross-functional teams to integrate models into production systems.

Required Skills & Experience

  • 2+ years professional experience in AI/ML engineering, with hands-on LLM deployment.
  • Expertise in Python, PyTorch, and ML pipeline development.
  • Experience with self-hosting LLMs, model serving (vLLM, Text-Generation-Inference, etc.), and GPU optimization.
  • Strong understanding of containerization (Docker/Kubernetes) and backend APIs (FastAPI/Flask).
  • Knowledge of vector databases (FAISS, Milvus, Pinecone) and retrieval strategies.
  • Familiarity with quantization, LoRA fine-tuning, and deployment optimization for cost efficiency.

Preferred Skills

  • Experience with Hugging Face Transformers, Accelerate, and LangChain.
  • Knowledge of open models (Llama, Mistral, Falcon, Mixtral, etc.) and quantized inference frameworks (GGUF, bitsandbytes, llama.cpp).
  • Exposure to MLOps, model observability, and reproducible training workflows.
  • Experience fine-tuning or benchmarking foundation models on custom datasets.
Apply Now