This is a pretty niche project, but I wanted to build a proof of concept to see if I could avoid hundreds of Yoto cards ending up all over my kids’ room, along with the endless “I can’t find X” moments and needing to pair or manage Creative Cards from my phone.
So I built a small ESP32-S3 touchscreen device that talks directly to the official Yoto APIs. It boots in a few seconds and gives my six-year-old a simple interface where she can browse our Yoto library, pick what she wants, and tell the Yoto Player to start playing it.
On first boot, the device runs through the Yoto device auth flow and shows a QR code that you can scan with your phone to sign in.
After that, it retrieves the card/library list from Yoto and stores the data locally on an SD card. It also downloads the cover images from the Yoto CDN, scales them down, converts them for the display, and caches them locally too. You’d be surprised how slow an ESP32-S3 can feel once you start doing image fetching, decoding, scaling, rendering, and SD card access on-device.
Overall I’m really happy with the result, and more importantly, so is my daughter. It provided instant value: she can choose what she wants without needing the physical cards or my phone.
That said, I’m probably going to pivot the project to a Raspberry Pi. The ESP32-S3 can do it, but the memory limits are tight, especially with a touchscreen UI and multiple card covers on screen at once. Rendering eight covers, caching nearby pages, and keeping the interface responsive eats through PSRAM pretty quickly.
Providing the code as a snapshot of the art of possible. I'll also hold my hands up for abuse here (possibly) - I only really develop in Python and ESP32 microcontrollers benefit greatly from C++ to perform well after initial testing - so it's totally vibe coded with contionious steering from my engineering background to keep it from going wild.
Code is here:
https://github.com/eperdeme/yoto-touch/tree/main