Author: Abdoulaye

  • Small Language Models: Notes from the past couple of weeks ๐Ÿค–๐Ÿคฏ

    The past few days have brought interesting developments in small language models that could expand mobile computing and low-resource environment applications.

    Here’s what caught my attention:

    โ€ข Microsoft’s Phi was made fully open source (MIT license) and has been improved by Unsloth AI. ๐Ÿš€๐Ÿ”“ Blog: https://unsloth.ai/blog/phi4

    โ€ข Kyutai Labs based in Paris ๐Ÿ‡ซ๐Ÿ‡ท introduced Helium-1 Preview, a 2B-parameter multilingual base LLM designed for edge and mobile devices.

    Model: https://huggingface.co/kyutai/helium-1-preview-2b

    Blog: https://kyutai.org/2025/01/13/helium.html

    โ€ข OpenBMB from China ๐Ÿ‡จ๐Ÿ‡ณ, released MiniCPM-o 2.6, an 8B-parameter multimodal model that matches the capabilities of several larger models. Model: https://huggingface.co/openbmb/MiniCPM-o-2_6

    โ€ข Moondream2 added gaze ๐Ÿ‘€ detection functionality with intestesting application for human-computer interaction and market research applications.

    Blog: https://moondream.ai/blog/announcing-gaze-detection

    โ€ข OuteTTS, a series of small Text-To-Speech model variants expanded to support 6 languages and punctuation for more natural sounding speech synthesis. ๐Ÿ—ฃ๏ธ

    Model: https://huggingface.co/OuteAI/OuteTTS-0.3-1B

    These developments suggest continued progress in making language models more efficient and accessible and we’re likely to see more of this in 2025.

    Note: Views on this post are my own opinion.

  • Pastra – A Practical Guide to the Gemini Multimodal Live API

    Google’s Gemini Multimodal Live API provides developers with tools to build AI applications that process and respond to real-time multimodal input (audio, video, and text). Heiko Hotz, a Gemini expert at Google, has created a project called Pastra, a comprehensive guide to help developers get started with this technology.

    What the guide covers:

    • An introduction to the Gemini Multimodal Live API and its capabilities.
    • Practical code examples and tutorials for building applications.
    • Insights into real-time communication and audio processing techniques used by Gemini, such as low-latency audio chunking and system Voice Activity Detection (VAD).

    Getting started with the guide:

    • Clone the repository:ย 
    git clone https://github.com/heiko-hotz/gemini-multimodal-live-dev-guide
    • Add your API key: Update the index.html files manually in all the sub directories with your API key, or use the command and replace the text “add your key here” with your key: 
    find . -name index.html -exec sed -i '' 's/const apiKey = '\''<YOUR_API_KEY>'\''/const apiKey = '\''add your key here'\''/g' {} \
    • Start the server: 
    python server.py

    The guide offers a practical starting point for developers interested in exploring the potential of the Gemini Multimodal Live API for building interactive AI applications. Have fun!