Skip to main content

    Hands-on Guide: Deploying LLMs in Production Workflow

    llm deploymentfine-tuningmodel servinggpu infrastructureproduction workflowunsloth sglang
    February 10, 2026

    The LinkedIn post by Akshay Pachaar provides a hands-on guide for deploying Large Language Models (LLMs) in production, based on two years of experience. It outlines a specific workflow used at his company, emphasizing the transition from local experimentation to a robust, scalable production environment using on-demand GPU infrastructure.

    The core problem addressed is the limitation of local experimentation for LLMs, necessitating proper compute for fine-tuning, exporting in the correct format, and deploying behind an endpoint capable of handling real requests. The workflow prioritizes infrastructure staying "out of the way," offering flexibility to move between cheaper GPUs for prototyping and higher-end options for scaling without quotas or approvals, thus accelerating iteration. The author appreciates RunPod for achieving this ideal.

    Steps and Procedures:

    1. Spin up a RunPod Pod with a GPU: An RTX 4090 is recommended, with the laptop serving as the UI.
    2. Open Jupyter inside the Pod: All training and deployment code runs directly on the GPU.
    3. Load gpt-oss-20B with Unsloth: Optimizations kick in at import time, making a 20B model practical.
    4. Attach LoRA adapters: Train a small set of weights while keeping the base model frozen, instead of updating all 20B parameters.
    5. Run supervised fine-tuning: Unsloth's training loop is optimized for large models, ensuring fast training and low memory usage.
    6. Export the model: Save a merged 16-bit checkpoint combining the base model and LoRA adapters.
    7. Launch SGLang server: Loads the checkpoint and starts an OpenAI-compatible inference endpoint.
    8. Send requests using the standard OpenAI client: No custom tooling is needed.

    Names and Entities: The post is by an "Unknown Author" initially, but later mentions "Akshay Pachaar" as the author to follow. Key entities include RunPod (on-demand GPU infrastructure), Unsloth (efficient fine-tuning tool), and SGLang (model serving tool). The model mentioned is gpt-oss-20B, and LoRA adapters are used. The inference endpoint is OpenAI-compatible.

    Tools and Technologies: RunPod provides on-demand GPU infrastructure. Unsloth is used for efficient fine-tuning. SGLang is used for model serving. Jupyter is used inside the Pod for code execution. An RTX 4090 GPU is recommended. The standard OpenAI client is used for sending requests.

    Facts and Data: The author has been deploying LLMs in production for 2 years. The workflow enables fine-tuning and inference for a 20B model (gpt-oss-20B). A merged 16-bit checkpoint is saved. RunPod charges by the second.

    Share this

    Want AI summaries like this for everything you read?

    Timeln saves articles, videos, and posts — then summarizes, tags, and connects them so you never lose a good find again.

    Save anything

    one click

    AI summaries

    instant

    Connected ideas

    automatic

    Start saving for free

    Free forever · No credit card · 30 seconds to start