Source: https://www.linkedin.com/feed/update/urn:li:activity:7426560603277783040
LinkedIn Post by Unknown Author
I have been deploying LLMs in production for 2 yrs!Here's a hands-on guide if you want to do the same:Local experimentation only takes you so far.At some point, the model needs to leave your machine. It needs to be fine-tuned on proper compute, exported in the right format, and deployed behind an endpoint that can handle real requests.This is the workflow we use at our company:- RunPod for on-demand GPU infra- Unsloth for efficient fine-tuning- SGLang for model servingHere's how it works step by step:- Spin up a RunPod Pod with a GPU (RTX 4090 works great). Your laptop just becomes the UI.- Open Jupyter inside the Pod. All training and deployment code runs directly on the GPU.- Load gpt-oss-20B with Unsloth. The optimizations kick in at import time, making a 20B model actually practical to work with.- Attach LoRA adapters. Instead of updating all 20B parameters, you train a small set of weights while keeping the base model frozen.- Run supervised fine-tuning. Unsloth's training loop is optimized for large models. Training stays fast, memory stays low.- Export the model. Save a merged 16-bit checkpoint that combines the base model and LoRA adapters into one artifact.- Launch SGLang server. It loads your checkpoint and starts an OpenAI-compatible inference endpoint.- Send requests using the standard OpenAI client. No custom tooling needed.This setup takes gpt-oss-20B from fine-tuning to real inference, all running on an on-demand GPU compute.Everything above ran on RunPod. Fine-tuning, export, and deployment, all on the same infrastructure, and I worked with the team to put this together.What I appreciate about it is that the infrastructure stays out of the way. You rent the GPU, do your work, and pay by the second. When you’re prototyping, you use a cheaper GPU.When you’re ready to scale, the higher-end options are there. The flexibility to move between these without dealing with quotas or approvals makes iteration much faster.Infrastructure should disappear into the background. RunPod gets close to that ideal.To get started, I have shared a link in the first comment. _____Share this with your network if you found this insightful ♻️Follow me (Akshay Pachaar) for more insights and tutorials on AI and Machine Learning!
[Image: text]
TEXT
The LinkedIn post by Akshay Pachaar provides a hands-on guide for deploying Large Language Models (LLMs) in production, based on two years of experience. It outlines a specific workflow used at his company, emphasizing the transition from local experimentation to a robust, scalable production environment using on-demand GPU infrastructure.
The core problem addressed is the limitation of local experimentation for LLMs, necessitating proper compute for fine-tuning, exporting in the correct format, and deploying behind an endpoint capable of handling real requests. The workflow prioritizes infrastructure staying "out of the way," offering flexibility to move between cheaper GPUs for prototyping and higher-end options for scaling without quotas or approvals, thus accelerating iteration. The author appreciates RunPod for achieving this ideal.
Steps and Procedures:
Names and Entities: The post is by an "Unknown Author" initially, but later mentions "Akshay Pachaar" as the author to follow. Key entities include RunPod (on-demand GPU infrastructure), Unsloth (efficient fine-tuning tool), and SGLang (model serving tool). The model mentioned is gpt-oss-20B, and LoRA adapters are used. The inference endpoint is OpenAI-compatible.
Tools and Technologies: RunPod provides on-demand GPU infrastructure. Unsloth is used for efficient fine-tuning. SGLang is used for model serving. Jupyter is used inside the Pod for code execution. An RTX 4090 GPU is recommended. The standard OpenAI client is used for sending requests.
Facts and Data: The author has been deploying LLMs in production for 2 years. The workflow enables fine-tuning and inference for a 20B model (gpt-oss-20B). A merged 16-bit checkpoint is saved. RunPod charges by the second.
Save and connect content like this in your personal library.
Save any link or doc
One place for everything
AI summaries and topics
Understand at a glance
Your second brain
Search and connect ideas
Join thousands building their second brain