Fine-tuning a 600M-parameter Qwen model into a reliable question classifier

A hobbyist building a household RAG chatbot needed a cheap way to route questions into metadata categories (pool, car, HVAC, cooking) so vector searches only scan relevant indexed entries. The experiment: can a tiny 0.6B-parameter Qwen 3 model be fine-tuned into a dependable classifier? Out of the box, prompting alone was useless — the base model got just 13 of 131 test questions right (~10%), leaned on a few broad labels, ignored most categories, and invented labels outside the allowed list.

Fine-tuning with Unsloth and QLoRA on roughly 850 examples (split 70/15/15) jumped accuracy to 79%. The author emphasizes that dataset quality mattered more than tweaking hyperparameters, and that holding out test data is essential to catch overfitting. The remaining errors clustered around semantic overlap: the model emitted partial labels like ‘ac/air’ instead of ‘hvac’ and confused water-related categories (pool, fountain, water heater).

The key insight came from a small prompt change rather than more training data. Instead of asking the model to output descriptive category strings, the author mapped each category to a two-character opaque ID with no semantic overlap. Forcing a fixed, non-overlapping output format pushed accuracy to ~92%. A few stubborn misses remain — water heater still maps to pool — but the fine-tuned tiny model is now a usable, low-cost preprocessing step. The takeaway for anyone running local models: small classifiers can punch well above their parameter count when the label space is engineered to avoid ambiguity.