Category: AI & Machine Learning

  • My Notes on Exploring Google’s Health Foundation Models

    (Note: This post reflects my personal opinions and may not reflect those of my employer)

    Article content
    Example of the HeAR encoder that generates a machine learning representation (known as “embeddings”)

    This image is a spectrogram representing my name, “Abdoulaye,” generated from my voice audio by HeAR (Health Acoustic Representations). HeAR is one of the recently released Health AI foundation models by Google. I’ve been captivated by these foundation models lately, spending time digging into them, playing with the demos and notebooks, reading ML papers about the models, and also learning more about embeddings in general and their usefulness in low-resource environments. All of this started after playing with a couple of the notebooks.

    Embeddings are numerical representations of data. AI models learn to create these compact summaries (vectors) from various inputs like images, sounds, or text, capturing essential features. These information-rich numerical representations are useful because they can serve as a foundation for developing new, specialized AI models, potentially reducing the amount of task-specific data and development time required. This efficiency is especially crucial in settings where large, labeled medical datasets may be scarce.

    • If you would like to read further into what Embeddings are, Vicki Boykis’ essay is such a great free resource; this essay is also ideal to learn or dive into machine learning. I know many of my previous colleagues from the telco and engineering world will love this: https://vickiboykis.com/what_are_embeddings/
    • For a technical perspective on their evolution, check out the word2vec paper: https://arxiv.org/abs/1301.3781

    The HeAR model, which processed my voice audio, is trained on over 300 million audio clips (e.g., coughs, breathing, speech). Its application can extend to identifying acoustic biomarkers for conditions like TB or COVID-19. It utilizes a Vision Transformer (ViT) to analyze spectrograms. Below, you can see an example of sneezing being detected within an audio file, and later, throat clearing detected at the end.

    Health event detector demo

    Article content
    Health event detector demo

    This release also includes other open-weight foundation models, each designed to generate high-quality embeddings:

    Derm Foundation (Skin Images) This model processes dermatology images to produce embeddings, aiming to make AI development for skin image analysis more efficient by reducing data and compute needs. It facilitates the development of tools for various tasks, such as classifying clinical conditions or assessing image quality.

    Explore the Derm Foundation model site for more information and to download the model use this link.

    CXR Foundation (Chest X-rays) The CXR Foundation model produces embeddings from chest X-ray images, which can then be used to train models for various chest X-ray related tasks. The models were trained on very large X-ray datasets. What got my attention, some models within the collection, like ELIXR-C, use an approach inspired by CLIP (contrastive language-image pre-training) to link images with text descriptions, enabling powerful zero-shot classification. This means the model might classify an X-ray for a condition it wasn’t specifically trained on, simply by understanding a text description of that condition which i find fascinating. The embeddings generated can also be used to train models that can detect diseases like tuberculosis without a large amount of data; for instance, “models trained on the embeddings derived from just 45 tuberculosis-positive images were able to achieve diagnostic performance non-inferior to radiologists.” This data efficiency is particularly valuable in regions with limited access to large, labeled datasets. Read the paper for more details.

    Retrieve images by text queries

    Article content
    Retrieve images by text queries demo

    Path Foundation (Pathology Slides) Google’s Path Foundation model is trained on large-scale digital pathology datasets to produce embeddings from these complex microscopy images. Its primary purpose is to enable more efficient development of AI tools for pathology image analysis. This approach supports tasks like identifying tumor tissue or searching for similar image regions, using significantly less data and compute. See the impressive Path Foundation demos on HuggingFace.

    Article content
    Path foundation demos

    Outlier Tissue Detector Demo

    These models are provided as Open Weight with the goal of enabling developers and researchers to download and adapt them, fostering the creation of localized AI tools. In my opinion, this is particularly exciting for regions like Africa, where such tools could help address unique health challenges and bridge gaps in access to specialist diagnostic capabilities.

    For full acknowledgment of contributions from various institutions, including partners like the Center for Infectious Disease Research in Zambia, please refer to the detailed in the paper.

    For those interested in the architectural and training methodologies, here are some of the pivotal papers and concepts relevant to these foundation models:

    #AIforHealth #FoundationModels #GlobalHealth #AIinAfrica #ResponsibleAI #MedTech #Innovation #GoogleResearch #Embeddings #MachineLearning #DeepLearning

  • Visualizing equations and functions using Gemini and Three.js (Vibe coded )

    Visualizing Machine Learning: An Interactive 3D Guide to Gradient Descent & SVMs

    From Gaussian Curves to the Heat Equation

  • Managing ML Projects: A Guide for Beginners and Professionals

    How do you manage ML projects? 🤔  A question I hear often!
    Working in research over the years, I often got asked about the day-to-day of managing machine learning projects. That’s why I’m excited about Google’s new, FREE “Managing ML Projects” guide which I can now point to going forward. it’s only 90 minutes but a good start!

    It can be useful for:

    * Those entering the ML field 🚀: Providing a clear, structured approach.
    * Professionals seeking to refine their ML project management skills.
    * Individuals preparing for ML-related interviews: Offering practical insights and frameworks.

    This guide covers:

    * ML project lifecycle management.
    * Applying established project management principles to ML.
    * Navigating traditional and generative AI projects.
    * Effective stakeholder collaboration.

    If you’re curious about ML project management, or want to level up your skills, take a look!

    https://developers.google.com/machine-learning/managing-ml-projects

  • SigLIP 2: Multilingual Vision-Language Encoders Released

    Google DeepMind has released SigLIP 2, a family of Open-weight (Apache V2) vision-language encoders trained on data covering 109 languages, including Swahili. The released models are available in four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).



    Why is this important?

    This release offers improved multilingual capabilities, covering 109 languages, which can contribute to more inclusive and accurate AI systems. It also features better image recognition and document understanding. The four model sizes offer flexibility and potentially increased accessibility for resource-constrained environments.



    Models: https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md

    Paper: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    https://arxiv.org/pdf/2502.14786

    HuggingFace Blog and Demo: https://huggingface.co/blog/siglip2

    Google Colab: https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP2_demo.ipynb

    Credits:  "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features" by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai (2025).
  • SMOL: New Open-Source Dataset for Low-Resource Language Machine Translation

    🎉  My colleagues and members of the language community have released SMOL, a new open-source dataset (CC-BY-4) designed for machine translation research. SMOL includes professionally translated parallel text for over 115 low-resource languages, with a significant representation of over 50 African languages. This dataset is intended to provide a valuable resource for researchers working on machine translation for under-represented languages.

    Kindly check the paper for more details including limitations of this dataset.

    Paper: https://arxiv.org/pdf/2502.12301
    Dataset: https://huggingface.co/datasets/google/smol

    List of languages:

    Afar
    Acoli
    Afrikaans
    Alur
    Amharic
    Bambara
    Baoulé
    Bemba (Zambia)
    Berber
    Chiga
    Dinka
    Dombe
    Dyula
    Efik
    Ewe
    Fon
    Fulfulde
    Ga
    Hausa
    Igbo
    Kikuyu
    Kongo
    Kanuri
    Krio
    Kituba (DRC)
    Lingala
    Luo
    Kiluba (Luba-Katanga)
    Malagasy
    Mossi
    North Ndebele
    Ndau
    Nigerian Pidgin
    Oromo
    Rundi
    Kinyarwanda
    Sepedi
    Shona
    Somali
    South Ndebele
    Susu
    Swati
    Swahili
    Tamazight
    Tigrinya
    Tiv
    Tsonga
    Tumbuka
    Tswana
    Twi
    Venda
    Wolof
    Xhosa
    Yoruba
    Zulu

    Credits:

    Isaac Caswell and Elizabeth Nielsen and Jiaming Luo and Colin Cherry and Geza Kovacs and Hadar Shemtov and Partha Talukdar and Dinesh Tewari and Baba Mamadi Diane and Koulako Moussa Doumbouya and Djibrila Diane and Solo Farabado Cissé. SMOL: Professionally translated parallel data for 115 under-represented languages.
  • Small Language Models: Notes from the past couple of weeks 🤖🤯

    The past few days have brought interesting developments in small language models that could expand mobile computing and low-resource environment applications.

    Here’s what caught my attention:

    • Microsoft’s Phi was made fully open source (MIT license) and has been improved by Unsloth AI. 🚀🔓 Blog: https://unsloth.ai/blog/phi4

    Kyutai Labs based in Paris 🇫🇷 introduced Helium-1 Preview, a 2B-parameter multilingual base LLM designed for edge and mobile devices.

    Model: https://huggingface.co/kyutai/helium-1-preview-2b

    Blog: https://kyutai.org/2025/01/13/helium.html

    • OpenBMB from China 🇨🇳, released MiniCPM-o 2.6, an 8B-parameter multimodal model that matches the capabilities of several larger models. Model: https://huggingface.co/openbmb/MiniCPM-o-2_6

    • Moondream2 added gaze 👀 detection functionality with intestesting application for human-computer interaction and market research applications.

    Blog: https://moondream.ai/blog/announcing-gaze-detection

    • OuteTTS, a series of small Text-To-Speech model variants expanded to support 6 languages and punctuation for more natural sounding speech synthesis. 🗣️

    Model: https://huggingface.co/OuteAI/OuteTTS-0.3-1B

    These developments suggest continued progress in making language models more efficient and accessible and we’re likely to see more of this in 2025.

    Note: Views on this post are my own opinion.

  • Pastra – A Practical Guide to the Gemini Multimodal Live API

    Google’s Gemini Multimodal Live API provides developers with tools to build AI applications that process and respond to real-time multimodal input (audio, video, and text). Heiko Hotz, a Gemini expert at Google, has created a project called Pastra, a comprehensive guide to help developers get started with this technology.

    What the guide covers:

    • An introduction to the Gemini Multimodal Live API and its capabilities.
    • Practical code examples and tutorials for building applications.
    • Insights into real-time communication and audio processing techniques used by Gemini, such as low-latency audio chunking and system Voice Activity Detection (VAD).

    Getting started with the guide:

    • Clone the repository: 
    git clone https://github.com/heiko-hotz/gemini-multimodal-live-dev-guide
    • Add your API key: Update the index.html files manually in all the sub directories with your API key, or use the command and replace the text “add your key here” with your key: 
    find . -name index.html -exec sed -i '' 's/const apiKey = '\''<YOUR_API_KEY>'\''/const apiKey = '\''add your key here'\''/g' {} \
    • Start the server: 
    python server.py

    The guide offers a practical starting point for developers interested in exploring the potential of the Gemini Multimodal Live API for building interactive AI applications. Have fun!