At Counterpoint’s AI 360 Summit, held in Silicon Valley recently, Mohit Agrawal, Research Director at Counterpoint, sat down with Qualcomm’s VP and Head of GenAI/ML Vinesh Sukumar to get a comprehensive overview of the evolving mobile AI landscape. The two discussed the challenges and strategies for running large language models (LLMs) effectively on mobile devices. Sukumar also highlighted different approaches, from finetuning LLMs for specific tasks to model compression techniques like quantization and more. The discussion also explored hybrid approaches that combine on-device processing with cloud-based LLM services.
• Quantization: One of the ways to reduce the model size is by representing weights with fewer bits like 2 bits or 4 bits. With this compression, model size can be reduced from 13GB to 3.5GB or 4GB.
• Smaller models: Another technique involves finetuning the larger models on task-specific datasets, resulting in smaller models (SLMs), with around 2-3 billion parameters.
• Instruction finetuning and retrieval augmentation (RAG): One can finetune by using longer prompts with smaller models for improved accuracy, and with the retrieval augmentation (RAG) technique, one can combine models with knowledge graphs to reduce the model size.
• Multiple small models: One of the possible model deployment strategies could be to deploy multiple small, task-specific models for different applications.
• One large model with adapters: Alternatively, there can be a single large model with multiple adapters for different tasks.
• Hybrid approach: As is the case today with most AI smartphones, AI combines on-device processing with cloud-based LLM.
The key factors range from use cases and specific requirements of the application to memory available on the device. Other factors include cost constraints and market-specific considerations.
• Smartphones are one of the most prominent devices for running LLMs at the edge. On-device AI is driven by the need for privacy and security and to run AI features without the internet.
• The smartphones are memory-constrained as LLMs need significantly more than 8-12GB that is currently available on the devices. For example, a 7-billion parameter model needs 13GB of memory. As a result, the industry needs to work on two fronts – make the hardware more capable and reduce the size of AI models without compromising much on accuracy.
• In 2024, over 100 models in sub-10-billion parameters were launched, out of which 31 were below 3 billion parameters. However, techniques like quantization and pruning are still needed to take advantage of larger models’ features such as reasoning.
Related Research
Jan 22, 2025
Jan 16, 2025
Jan 15, 2025
Jan 20, 2025
Dec 23, 2024
Dec 17, 2024