Author: Abdoulaye

  • TxGemma Release: AI Models for Therapeutics Development 🧪🔬

    Google DeepMind has released TxGemma, a set of open-weight AI models designed for therapeutic development. These models, based on the Gemma architecture, are trained to analyze and predict characteristics of therapeutic entities during drug discovery. 💊

    The release includes ‘chat’ variants (9B and 27B) that can engage in dialogue and provide explanations for their predictions. Additionally, Agentic-Tx demonstrates the integration of TxGemma into an agentic system for multi-step research questions. 🤖

    A fine-tuning notebook is available for custom task adaptation:

    Execution is possible on a free T4 GPU after license acceptance and Hugging Face token provision:

    If you encounter issues with the provided fine-tuning notebook, you can check my pre-configured Colab notebook:

    Further resources:

    Credit for this release: Shekoofeh Azizi and other contributors. 🎉

  • Gemma 3: Massive Context, 35+ Languages, and Multimodal Capabilities

    🚨 Gemma 3 is out! It’s a family of open AI models (1B-27B parameters) featuring a 128k token context window (can work with very long documents and conversations), multilingual support (35+ languages, trained on 140+), and single GPU/TPU compatibility. I’m excited about its potential to increase accessibility to advanced AI models, especially in resource-constrained settings, and the multimodal capabilities that can enable diverse applications.

    Blog: https://blog.google/technology/developers/gemma-3/

    Technical report: https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf

    Developer guide: https://developers.googleblog.com/en/introducing-gemma3/

  • Spatial Queries on Hout Bay Data Using Gemini ‘s DataScience Agent

    I tested the Gemini Datascience agent with the Hout Bay (Cape Town, South Africa) building data footprint, asking simple spatial questions, “show me small houses” and “identify crowded areas” “what about large houses with few neighbors”. The agent generates interesting visualizations and can select various algorithms, for example it picked k-Nearest Neighbors (k-NN) to detect houses with adjacent neighbors. I spent wayyy too much time on this, but I really liked the interactive aspect to make refinements iteratively by just making suggestions and asking for alternatives, kind of chatting with a Datascience expert :). I guess you would call this Conversational geospatial data analysis?

  • Colab Updates: Julia Support and Gemini Data Science

    Google Colab has been updated with interesting new features. Julia is now supported natively, so no more need for workarounds! Plus, the Gemini Data Science agent is now more widely accessible. This agent lets you query data through simple prompts, like asking for trend visualizations or model comparisons. It aims to reduce the time spent on tasks like data loading and library imports. This can, for example, contribute to faster prototyping and more efficient data exploration.

    Blog on the Gemini Data Science Agent: https://developers.googleblog.com/en/data-science-agent-in-colab-with-gemini/

  • 10Gbps Over 1km: Taara’s Incredible Silicon Photonics Breakthrough

    I find this simply incredible. This new Taara chip is smaller than a fingernail, yet it can transmit data at 10 gigabits per second over a 1KM DISTANCE! 🤯🤯🤯

    “In tests at the Moonshot Factory labs, our team has successfully transmitted data at 10 Gbps (gigabits per second) over distances of 1 kilometer outdoors using two Taara chips. We believe this is the first time silicon photonics chips have transmitted such high-capacity data outdoors at this distance. And this is just the beginning. We plan to extend both the chip’s range and capacity by creating an iteration with thousands of emitters.”

    The previous version of Taara, the light bridge, steered light beams mechanically using mirrors and sensors. Now, they’ve shrunk it to the size of a coin, replacing much of the hardware with software.
    I’ve been a huge fan of this project for many years, and it’s exciting to see this ‘moonshot’ turning into reality. It can bring high-speed internet to underserved regions, change how data centers operate and so much more. Huge congrats to the Taara team!

    https://x.company/blog/posts/taara-chip

  • Managing ML Projects: A Guide for Beginners and Professionals

    How do you manage ML projects? 🤔  A question I hear often!
    Working in research over the years, I often got asked about the day-to-day of managing machine learning projects. That’s why I’m excited about Google’s new, FREE “Managing ML Projects” guide which I can now point to going forward. it’s only 90 minutes but a good start!

    It can be useful for:

    * Those entering the ML field 🚀: Providing a clear, structured approach.
    * Professionals seeking to refine their ML project management skills.
    * Individuals preparing for ML-related interviews: Offering practical insights and frameworks.

    This guide covers:

    * ML project lifecycle management.
    * Applying established project management principles to ML.
    * Navigating traditional and generative AI projects.
    * Effective stakeholder collaboration.

    If you’re curious about ML project management, or want to level up your skills, take a look!

    https://developers.google.com/machine-learning/managing-ml-projects

  • SigLIP 2: Multilingual Vision-Language Encoders Released

    Google DeepMind has released SigLIP 2, a family of Open-weight (Apache V2) vision-language encoders trained on data covering 109 languages, including Swahili. The released models are available in four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).



    Why is this important?

    This release offers improved multilingual capabilities, covering 109 languages, which can contribute to more inclusive and accurate AI systems. It also features better image recognition and document understanding. The four model sizes offer flexibility and potentially increased accessibility for resource-constrained environments.



    Models: https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/README_siglip2.md

    Paper: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    https://arxiv.org/pdf/2502.14786

    HuggingFace Blog and Demo: https://huggingface.co/blog/siglip2

    Google Colab: https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP2_demo.ipynb

    Credits:  "SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features" by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai (2025).
  • SMOL: New Open-Source Dataset for Low-Resource Language Machine Translation

    🎉  My colleagues and members of the language community have released SMOL, a new open-source dataset (CC-BY-4) designed for machine translation research. SMOL includes professionally translated parallel text for over 115 low-resource languages, with a significant representation of over 50 African languages. This dataset is intended to provide a valuable resource for researchers working on machine translation for under-represented languages.

    Kindly check the paper for more details including limitations of this dataset.

    Paper: https://arxiv.org/pdf/2502.12301
    Dataset: https://huggingface.co/datasets/google/smol

    List of languages:

    Afar
    Acoli
    Afrikaans
    Alur
    Amharic
    Bambara
    Baoulé
    Bemba (Zambia)
    Berber
    Chiga
    Dinka
    Dombe
    Dyula
    Efik
    Ewe
    Fon
    Fulfulde
    Ga
    Hausa
    Igbo
    Kikuyu
    Kongo
    Kanuri
    Krio
    Kituba (DRC)
    Lingala
    Luo
    Kiluba (Luba-Katanga)
    Malagasy
    Mossi
    North Ndebele
    Ndau
    Nigerian Pidgin
    Oromo
    Rundi
    Kinyarwanda
    Sepedi
    Shona
    Somali
    South Ndebele
    Susu
    Swati
    Swahili
    Tamazight
    Tigrinya
    Tiv
    Tsonga
    Tumbuka
    Tswana
    Twi
    Venda
    Wolof
    Xhosa
    Yoruba
    Zulu

    Credits:

    Isaac Caswell and Elizabeth Nielsen and Jiaming Luo and Colin Cherry and Geza Kovacs and Hadar Shemtov and Partha Talukdar and Dinesh Tewari and Baba Mamadi Diane and Koulako Moussa Doumbouya and Djibrila Diane and Solo Farabado Cissé. SMOL: Professionally translated parallel data for 115 under-represented languages.
  • Small Language Models: Notes from the past couple of weeks 🤖🤯

    The past few days have brought interesting developments in small language models that could expand mobile computing and low-resource environment applications.

    Here’s what caught my attention:

    • Microsoft’s Phi was made fully open source (MIT license) and has been improved by Unsloth AI. 🚀🔓 Blog: https://unsloth.ai/blog/phi4

    Kyutai Labs based in Paris 🇫🇷 introduced Helium-1 Preview, a 2B-parameter multilingual base LLM designed for edge and mobile devices.

    Model: https://huggingface.co/kyutai/helium-1-preview-2b

    Blog: https://kyutai.org/2025/01/13/helium.html

    • OpenBMB from China 🇨🇳, released MiniCPM-o 2.6, an 8B-parameter multimodal model that matches the capabilities of several larger models. Model: https://huggingface.co/openbmb/MiniCPM-o-2_6

    • Moondream2 added gaze 👀 detection functionality with intestesting application for human-computer interaction and market research applications.

    Blog: https://moondream.ai/blog/announcing-gaze-detection

    • OuteTTS, a series of small Text-To-Speech model variants expanded to support 6 languages and punctuation for more natural sounding speech synthesis. 🗣️

    Model: https://huggingface.co/OuteAI/OuteTTS-0.3-1B

    These developments suggest continued progress in making language models more efficient and accessible and we’re likely to see more of this in 2025.

    Note: Views on this post are my own opinion.

  • Pastra – A Practical Guide to the Gemini Multimodal Live API

    Google’s Gemini Multimodal Live API provides developers with tools to build AI applications that process and respond to real-time multimodal input (audio, video, and text). Heiko Hotz, a Gemini expert at Google, has created a project called Pastra, a comprehensive guide to help developers get started with this technology.

    What the guide covers:

    • An introduction to the Gemini Multimodal Live API and its capabilities.
    • Practical code examples and tutorials for building applications.
    • Insights into real-time communication and audio processing techniques used by Gemini, such as low-latency audio chunking and system Voice Activity Detection (VAD).

    Getting started with the guide:

    • Clone the repository: 
    git clone https://github.com/heiko-hotz/gemini-multimodal-live-dev-guide
    • Add your API key: Update the index.html files manually in all the sub directories with your API key, or use the command and replace the text “add your key here” with your key: 
    find . -name index.html -exec sed -i '' 's/const apiKey = '\''<YOUR_API_KEY>'\''/const apiKey = '\''add your key here'\''/g' {} \
    • Start the server: 
    python server.py

    The guide offers a practical starting point for developers interested in exploring the potential of the Gemini Multimodal Live API for building interactive AI applications. Have fun!