Counterpoint Conversations: Making LLMs Practical for Smartphones

Site Map

Counterpoint Conversations: Making LLMs Practical for Smartphones

Jan 23, 2025

At Counterpoint’s AI 360 Summit, held in Silicon Valley recently, Mohit Agrawal, Research Director at Counterpoint, sat down with Qualcomm’s VP and Head of GenAI/ML Vinesh Sukumar to get a comprehensive overview of the evolving mobile AI landscape. The two discussed the challenges and strategies for running large language models (LLMs) effectively on mobile devices. Sukumar also highlighted different approaches, from finetuning LLMs for specific tasks to model compression techniques like quantization and more. The discussion also explored hybrid approaches that combine on-device processing with cloud-based LLM services.

The Interview

Key takeaways from the discussion

Techniques for reducing model size:

• Quantization: One of the ways to reduce the model size is by representing weights with fewer bits like 2 bits or 4 bits. With this compression, model size can be reduced from 13GB to 3.5GB or 4GB.

• Smaller models: Another technique involves finetuning the larger models on task-specific datasets, resulting in smaller models (SLMs), with around 2-3 billion parameters.

• Instruction finetuning and retrieval augmentation (RAG): One can finetune by using longer prompts with smaller models for improved accuracy, and with the retrieval augmentation (RAG) technique, one can combine models with knowledge graphs to reduce the model size.

Multiple-LLM or single-LLM deployment on smartphones?

• Multiple small models: One of the possible model deployment strategies could be to deploy multiple small, task-specific models for different applications.

• One large model with adapters: Alternatively, there can be a single large model with multiple adapters for different tasks.

• Hybrid approach: As is the case today with most AI smartphones, AI combines on-device processing with cloud-based LLM.

Key factors influencing deployment choices

The key factors range from use cases and specific requirements of the application to memory available on the device. Other factors include cost constraints and market-specific considerations.

Analyst takeaways

• Smartphones are one of the most prominent devices for running LLMs at the edge. On-device AI is driven by the need for privacy and security and to run AI features without the internet.

• The smartphones are memory-constrained as LLMs need significantly more than 8-12GB that is currently available on the devices. For example, a 7-billion parameter model needs 13GB of memory. As a result, the industry needs to work on two fronts – make the hardware more capable and reduce the size of AI models without compromising much on accuracy.

• In 2024, over 100 models in sub-10-billion parameters were launched, out of which 31 were below 3 billion parameters. However, techniques like quantization and pruning are still needed to take advantage of larger models’ features such as reasoning.

Summary

Author

Team Counterpoint

Counterpoint research is a young and fast growing research firm covering analysis of the tech industry. Coverage areas are connected devices, digital consumer goods, software & applications and other adjacent topics. We provide syndicated research reports as well as tailored. Our seminars and workshops for companies and institutions are popular and available on demand. Consulting and customer