The LinkedIn post by Akshay Pachaar provides a hands-on guide for deploying Large Language Models (LLMs) in production, based on two years of experience. It outlines a specific workflow used at his company, emphasizing the transition from local experimentation to a robust, scalable production environment using on-demand GPU infrastructure.
The core problem addressed is the limitation of local experimentation for LLMs, necessitating proper compute for fine-tuning, exporting in the correct format, and deploying behind an endpoint capable of handling real requests. The workflow prioritizes infrastructure staying "out of the way," offering flexibility to move between cheaper GPUs for prototyping and higher-end options for scaling without quotas or approvals, thus accelerating iteration. The author appreciates RunPod for achieving this ideal.
Steps and Procedures:
Names and Entities: The post is by an "Unknown Author" initially, but later mentions "Akshay Pachaar" as the author to follow. Key entities include RunPod (on-demand GPU infrastructure), Unsloth (efficient fine-tuning tool), and SGLang (model serving tool). The model mentioned is gpt-oss-20B, and LoRA adapters are used. The inference endpoint is OpenAI-compatible.
Tools and Technologies: RunPod provides on-demand GPU infrastructure. Unsloth is used for efficient fine-tuning. SGLang is used for model serving. Jupyter is used inside the Pod for code execution. An RTX 4090 GPU is recommended. The standard OpenAI client is used for sending requests.
Facts and Data: The author has been deploying LLMs in production for 2 years. The workflow enables fine-tuning and inference for a 20B model (gpt-oss-20B). A merged 16-bit checkpoint is saved. RunPod charges by the second.
Timeln saves articles, videos, and posts — then summarizes, tags, and connects them so you never lose a good find again.
Save anything
one click
AI summaries
instant
Connected ideas
automatic
Free forever · No credit card · 30 seconds to start