Author: Abdoulaye

  • My Notes on Exploring Google’s Health Foundation Models

    (Note: This post reflects my personal opinions and may not reflect those of my employer)

    Article content
    Example of the HeAR encoder that generates a machine learning representation (known as “embeddings”)

    This image is a spectrogram representing my name, “Abdoulaye,” generated from my voice audio by HeAR (Health Acoustic Representations). HeAR is one of the recently released Health AI foundation models by Google. I’ve been captivated by these foundation models lately, spending time digging into them, playing with the demos and notebooks, reading ML papers about the models, and also learning more about embeddings in general and their usefulness in low-resource environments. All of this started after playing with a couple of the notebooks.

    Embeddings are numerical representations of data. AI models learn to create these compact summaries (vectors) from various inputs like images, sounds, or text, capturing essential features. These information-rich numerical representations are useful because they can serve as a foundation for developing new, specialized AI models, potentially reducing the amount of task-specific data and development time required. This efficiency is especially crucial in settings where large, labeled medical datasets may be scarce.

    • If you would like to read further into what Embeddings are, Vicki Boykis’ essay is such a great free resource; this essay is also ideal to learn or dive into machine learning. I know many of my previous colleagues from the telco and engineering world will love this: https://vickiboykis.com/what_are_embeddings/
    • For a technical perspective on their evolution, check out the word2vec paper: https://arxiv.org/abs/1301.3781

    The HeAR model, which processed my voice audio, is trained on over 300 million audio clips (e.g., coughs, breathing, speech). Its application can extend to identifying acoustic biomarkers for conditions like TB or COVID-19. It utilizes a Vision Transformer (ViT) to analyze spectrograms. Below, you can see an example of sneezing being detected within an audio file, and later, throat clearing detected at the end.

    Health event detector demo

    Article content
    Health event detector demo

    This release also includes other open-weight foundation models, each designed to generate high-quality embeddings:

    Derm Foundation (Skin Images) This model processes dermatology images to produce embeddings, aiming to make AI development for skin image analysis more efficient by reducing data and compute needs. It facilitates the development of tools for various tasks, such as classifying clinical conditions or assessing image quality.

    Explore the Derm Foundation model site for more information and to download the model use this link.

    CXR Foundation (Chest X-rays) The CXR Foundation model produces embeddings from chest X-ray images, which can then be used to train models for various chest X-ray related tasks. The models were trained on very large X-ray datasets. What got my attention, some models within the collection, like ELIXR-C, use an approach inspired by CLIP (contrastive language-image pre-training) to link images with text descriptions, enabling powerful zero-shot classification. This means the model might classify an X-ray for a condition it wasn’t specifically trained on, simply by understanding a text description of that condition which i find fascinating. The embeddings generated can also be used to train models that can detect diseases like tuberculosis without a large amount of data; for instance, “models trained on the embeddings derived from just 45 tuberculosis-positive images were able to achieve diagnostic performance non-inferior to radiologists.” This data efficiency is particularly valuable in regions with limited access to large, labeled datasets. Read the paper for more details.

    Retrieve images by text queries

    Article content
    Retrieve images by text queries demo

    Path Foundation (Pathology Slides) Google’s Path Foundation model is trained on large-scale digital pathology datasets to produce embeddings from these complex microscopy images. Its primary purpose is to enable more efficient development of AI tools for pathology image analysis. This approach supports tasks like identifying tumor tissue or searching for similar image regions, using significantly less data and compute. See the impressive Path Foundation demos on HuggingFace.

    Article content
    Path foundation demos

    Outlier Tissue Detector Demo

    These models are provided as Open Weight with the goal of enabling developers and researchers to download and adapt them, fostering the creation of localized AI tools. In my opinion, this is particularly exciting for regions like Africa, where such tools could help address unique health challenges and bridge gaps in access to specialist diagnostic capabilities.

    For full acknowledgment of contributions from various institutions, including partners like the Center for Infectious Disease Research in Zambia, please refer to the detailed in the paper.

    For those interested in the architectural and training methodologies, here are some of the pivotal papers and concepts relevant to these foundation models:

    #AIforHealth #FoundationModels #GlobalHealth #AIinAfrica #ResponsibleAI #MedTech #Innovation #GoogleResearch #Embeddings #MachineLearning #DeepLearning

  • Visualizing equations and functions using Gemini and Three.js (Vibe coded )

    Visualizing Machine Learning: An Interactive 3D Guide to Gradient Descent & SVMs

    From Gaussian Curves to the Heat Equation

  • TxGemma Release: AI Models for Therapeutics Development 🧪🔬

    Google DeepMind has released TxGemma, a set of open-weight AI models designed for therapeutic development. These models, based on the Gemma architecture, are trained to analyze and predict characteristics of therapeutic entities during drug discovery. 💊

    The release includes ‘chat’ variants (9B and 27B) that can engage in dialogue and provide explanations for their predictions. Additionally, Agentic-Tx demonstrates the integration of TxGemma into an agentic system for multi-step research questions. 🤖

    A fine-tuning notebook is available for custom task adaptation:

    Execution is possible on a free T4 GPU after license acceptance and Hugging Face token provision:

    If you encounter issues with the provided fine-tuning notebook, you can check my pre-configured Colab notebook:

    Further resources:

    Credit for this release: Shekoofeh Azizi and other contributors. 🎉

  • Gemma 3: Massive Context, 35+ Languages, and Multimodal Capabilities

    🚨 Gemma 3 is out! It’s a family of open AI models (1B-27B parameters) featuring a 128k token context window (can work with very long documents and conversations), multilingual support (35+ languages, trained on 140+), and single GPU/TPU compatibility. I’m excited about its potential to increase accessibility to advanced AI models, especially in resource-constrained settings, and the multimodal capabilities that can enable diverse applications.

    Blog: https://blog.google/technology/developers/gemma-3/

    Technical report: https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf

    Developer guide: https://developers.googleblog.com/en/introducing-gemma3/

  • Spatial Queries on Hout Bay Data Using Gemini ‘s DataScience Agent

    I tested the Gemini Datascience agent with the Hout Bay (Cape Town, South Africa) building data footprint, asking simple spatial questions, “show me small houses” and “identify crowded areas” “what about large houses with few neighbors”. The agent generates interesting visualizations and can select various algorithms, for example it picked k-Nearest Neighbors (k-NN) to detect houses with adjacent neighbors. I spent wayyy too much time on this, but I really liked the interactive aspect to make refinements iteratively by just making suggestions and asking for alternatives, kind of chatting with a Datascience expert :). I guess you would call this Conversational geospatial data analysis?

  • Colab Updates: Julia Support and Gemini Data Science

    Google Colab has been updated with interesting new features. Julia is now supported natively, so no more need for workarounds! Plus, the Gemini Data Science agent is now more widely accessible. This agent lets you query data through simple prompts, like asking for trend visualizations or model comparisons. It aims to reduce the time spent on tasks like data loading and library imports. This can, for example, contribute to faster prototyping and more efficient data exploration.

    Blog on the Gemini Data Science Agent: https://developers.googleblog.com/en/data-science-agent-in-colab-with-gemini/

  • 10Gbps Over 1km: Taara’s Incredible Silicon Photonics Breakthrough

    I find this simply incredible. This new Taara chip is smaller than a fingernail, yet it can transmit data at 10 gigabits per second over a 1KM DISTANCE! 🤯🤯🤯

    “In tests at the Moonshot Factory labs, our team has successfully transmitted data at 10 Gbps (gigabits per second) over distances of 1 kilometer outdoors using two Taara chips. We believe this is the first time silicon photonics chips have transmitted such high-capacity data outdoors at this distance. And this is just the beginning. We plan to extend both the chip’s range and capacity by creating an iteration with thousands of emitters.”

    The previous version of Taara, the light bridge, steered light beams mechanically using mirrors and sensors. Now, they’ve shrunk it to the size of a coin, replacing much of the hardware with software.
    I’ve been a huge fan of this project for many years, and it’s exciting to see this ‘moonshot’ turning into reality. It can bring high-speed internet to underserved regions, change how data centers operate and so much more. Huge congrats to the Taara team!

    https://x.company/blog/posts/taara-chip

  • Managing ML Projects: A Guide for Beginners and Professionals

    How do you manage ML projects? 🤔  A question I hear often!
    Working in research over the years, I often got asked about the day-to-day of managing machine learning projects. That’s why I’m excited about Google’s new, FREE “Managing ML Projects” guide which I can now point to going forward. it’s only 90 minutes but a good start!

    It can be useful for:

    * Those entering the ML field 🚀: Providing a clear, structured approach.
    * Professionals seeking to refine their ML project management skills.
    * Individuals preparing for ML-related interviews: Offering practical insights and frameworks.

    This guide covers:

    * ML project lifecycle management.
    * Applying established project management principles to ML.
    * Navigating traditional and generative AI projects.
    * Effective stakeholder collaboration.

    If you’re curious about ML project management, or want to level up your skills, take a look!

    https://developers.google.com/machine-learning/managing-ml-projects

  • SigLIP 2: Multilingual Vision-Language Encoders Released

    Google DeepMind has released SigLIP 2, a family of Open-weight (Apache V2) vision-language encoders trained on data covering 109 languages, including Swahili. The released models are available in four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).



    Why is this important?

    This release offers improved multilingual capabilities, covering 109 languages, which can contribute to more inclusive and accurate AI systems. It also features better image recognition and document understanding. The four model sizes offer flexibility and potentially increased accessibility for resource-constrained environments.



    Models: https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md

    Paper: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    https://arxiv.org/pdf/2502.14786

    HuggingFace Blog and Demo: https://huggingface.co/blog/siglip2

    Google Colab: https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP2_demo.ipynb

    Credits:  "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features" by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai (2025).
  • SMOL: New Open-Source Dataset for Low-Resource Language Machine Translation

    🎉  My colleagues and members of the language community have released SMOL, a new open-source dataset (CC-BY-4) designed for machine translation research. SMOL includes professionally translated parallel text for over 115 low-resource languages, with a significant representation of over 50 African languages. This dataset is intended to provide a valuable resource for researchers working on machine translation for under-represented languages.

    Kindly check the paper for more details including limitations of this dataset.

    Paper: https://arxiv.org/pdf/2502.12301
    Dataset: https://huggingface.co/datasets/google/smol

    List of languages:

    Afar
    Acoli
    Afrikaans
    Alur
    Amharic
    Bambara
    Baoulé
    Bemba (Zambia)
    Berber
    Chiga
    Dinka
    Dombe
    Dyula
    Efik
    Ewe
    Fon
    Fulfulde
    Ga
    Hausa
    Igbo
    Kikuyu
    Kongo
    Kanuri
    Krio
    Kituba (DRC)
    Lingala
    Luo
    Kiluba (Luba-Katanga)
    Malagasy
    Mossi
    North Ndebele
    Ndau
    Nigerian Pidgin
    Oromo
    Rundi
    Kinyarwanda
    Sepedi
    Shona
    Somali
    South Ndebele
    Susu
    Swati
    Swahili
    Tamazight
    Tigrinya
    Tiv
    Tsonga
    Tumbuka
    Tswana
    Twi
    Venda
    Wolof
    Xhosa
    Yoruba
    Zulu

    Credits:

    Isaac Caswell and Elizabeth Nielsen and Jiaming Luo and Colin Cherry and Geza Kovacs and Hadar Shemtov and Partha Talukdar and Dinesh Tewari and Baba Mamadi Diane and Koulako Moussa Doumbouya and Djibrila Diane and Solo Farabado Cissé. SMOL: Professionally translated parallel data for 115 under-represented languages.