Ditch your Alexa

Bits and pieces of electronics used to assemble and test offline virtual assistants
How hard can it be to build an offline zero-Cloud virtual assistant? Janet Vertesi 2021

Most people I know took up sourdough bread baking during the pandemic. I built virtual assistants.

It started with an outdoor playdate where the kids at the family home we visited brought an Alexa out onto the porch. My kids watched, fascinated, as everyone gathered around and shouted out the song titles they wanted to hear. Alexa played them all, dutifully, in turn. When we tumbled into the car to go home, the inevitable question arose, "Mummy, can we get an Alexa?"

An open microphone in our house, pumping everything we say into the databanks of Jeff Bezos (or Google, or Apple) just so that my husband can set a timer in the kitchen and my kids can listen to Michael Jackson on demand? What I wanted to say was, "Over my dead body!"

Instead I cleared my throat and said calmly, "Mummy will build you an Alexa."

This Opt Out project took place through the winter of 2021-2022, before generative AI and chatbots had launched and while we endured one Covid-19 wave after another. My goal was to build something that could operate entirely without Cloud access: on-the-box only, no one else's computer involvedThe technologies were still in early stages, so finding on-the-box virtual assistants that could do speech-to-text recognition, on the fly, and respond to children and adults in a domestic setting was a challenge.

Here are the ones I tried, and the tools I used to try them out.

First, The Kit

To build a working virtual assistant, on the box like an Alexa, you need a couple of key components. Fortunately, you can use the same components and hardware for many different assistants. This was my base hardware:

A raspberry pi and a tangle of speakers and cords
Alice in action. Not much to look at but this machine got it where it counts. For me at least.

First, a small computer that can run a speech-recognition system and respond to commands. I worked with a couple of Raspberry Pi's (3Bs, mostly) that we had around the house. At the time, silicon chip components were skyrocketing in price and the global supply chain was totally disrupted. Raspberry Pi's suddenly were impossible to find. I'm glad I had one or two available that were mostly deprecated from old projects.

You need to be able to hear commands, so a microphone is necessary. Most DIY fans have come to rally around a video game camera, the Playstation Eye, that also has a great 3D microphone setup.  The virtual assistants don't use the camera input, only the mic.

There are also several microphone "hats" available for Raspberry Pi's. I used Respeeker 6-pin and Respeeker 2-pin hat. They basically snap on to the Pi and provide better output.

Output is important, because the assistants speak back to you.  For this I ordered a few small speakers from Adafruit that snapped into the microphone "hats".

How do you know when the voice assistant has heard you? I also bought a few pieces from Matrix that lit up, in the hopes I could channel some of that Alexa rainbow-LED swagger. I never actually got this far, but it could still happen.

Then you need a bunch of SD cards. The Virtual Assistants mostly all come as modified operating systems, so you have to download the OS and burn it to an SD card to put into the Pi computer.  I labeled all of the cards so I knew which was which. That way, I could keep largely the same hardware setup, just swap "the brains" in and out to test them out.

I also used a Linux laptop (check here for recco's) to build the disk images and SSH into the Pi's, as well as an old keyboard, mouse, and monitor as Pi peripherals to see what was going on on the box. 

The Assistants

I ended up testing four virtual assistants. They all used similar libraries for speech to text, and for text to speech that can be loaded locally. They work by waiting for their "watch-word," waking up to listen to what you have to say. Then they turn this short audio clip into text, process and interpret it, and spit back a text response that is ultimately processed into speech. So a lot of them sound the same, or have access to the same voices. The differences are in what they are programmed to do, or what they let you program them to do.

All of these will let you set timers and listen to Michael Jackson, no Bezos involved.

PiCroft by MyCroft.AI . This was one of the most advanced of the ones I tested and had a great command interface as well. It had lots of skills, from telling time to telling jokes, and I remember it not only searched and read out Wikipedia articles, but could even search Poképedia to identify any Pokémon (a big plus in my house). Sadly, it couldn't 'hear' or otherwise parse my childrens' voices. It could hardly hear my own voice either: I had to lower it to try to sound more masculine before it recognized what I had to say. So despite being a great piece of software on some technical levels, it wasn't exactly effective. (Scroll down, though, because I eventually bought a Mycroft and those problems were resolved.)

Project Alice was the most flexible of the projects I tested. Alice was built by a bunch of people who got angry when the open source "Snips" was picked up by Sonos to be the core speech functionality for their commercial sound systems. It allowed you to set up a main unit, and then put differnet Pi's into different rooms of your house, each with their own identity and managed through a simplified spacial map of your home. Alice can recognize different peoples' voices in your home as well to address them by name when called. For something so home-grown it had a pretty sophisticated setup, all run locally through your browser, that among other things allowed you to fine tune the things Alice said to you. The Alice community is small but mighty, writing skills in Python to do all kinds of things like tell you the weather, control your smart light bulbs, and explain where the International Space Station is right now.

Naomi is an open source project that was still in its early stages when I gave it a try. At the time it had basic functionality but has since expanded. It has an active Discord channel with community members building different skills for it in Python, and you can train it to better understand your voice. It was built based on a system built by two Princeton students in 2015, that they called Jasper, so it has been around possibly the longest. Certainly long before anyone else was building on-the-box asssistants.

Stanford Open Virtual Assistant. At the time, the best way to run this was through HomeAssistant to manage home IOT, which I didn't normally run as we don't have much in the way of IOT gadgets in our home (lots of processors, yes; smart toasters, no). As such, its functionality was relatively limited for my use case. However, this project has exploded in recent years and has progressed to working as on-the-box LLM's, with some super interesting research behind it.

Life with Alice & MyCroft

I really took to Alice, I must admit.  Alice didn't always do what you asked. Sometimes it answered, "Not now, I'm busy," or "Artificial Intelligence on strike: we want more watts!" It also had a feature where Alice would just randomly speak up at different times in the day. The programmers introduced this and made her say resistant or petulant things to give her a "personality." They clearly thought it was funny when a woman's voice piped up to sound obstinate, bossy, and a bit mopey.

Being female myself, I confess this joke landed poorly on me. But I did love this speak-up feature, so I reprogrammed "her" to say helpful and encouraging things to me throughout the day, like, "Have you had a rest lately? Consider stretching or taking a short break to keep your energy up." Or, "You're doing all this with kids and a job during a pandemic? Be kind to yourself!"

I didn't meet a single other mom who built a voice assistant during the pandemic. But every other mom I told about this feature asked if I could make them one too.

For a while, Alice perched on my desk under a stuffed Porg from Star Wars, giving the impression that it was the Porg talking back. I never got her to play Michael Jackson, as that required more setup (and it's kind of a sore point to ask the Snips team how to get Alice to play music, given the Sonos sale). But I did set alarms, timers, ask about the space station and the weather all over the world. And I traveled with "her" a few times, kicking back after a long day to ask "her" the time and the weather.

What Alice had in portability and spunk was not really addressing the problem I had promised to solve. So when a sale hit, I bought a Mycroft.AI unit and used it until the whole project shut down, in January 2024.

A white box with screen showing the time and a picture of the moon, and a table lamp behind it.
Mycroft Mark II in action.

Mycroft in a box worked a lot better than PiCroft. It could hear me, and my children, for one thing. It was easy to manage through a web-based interface. It searched Wikipedia and knew its Pokémon, played radio station music, told us the weather and the time, and downloaded daily news podcasts from NPR and the BBC. 

We all loved MyCroft and were sadly disappointed when we came back from holiday travels in January 2024 to find the project shelved. It wasn't so 'on the box' after all: the business model still required people to sign in and administer their local Mycroft through a proprietary web portal (unlike Alice, where your browser helped you tunnel into a local control panel). The project has since evolved into and Open Conversational Voice and will hopefully go on to power more and better systems.

In the time since I experimented other virtual assistants and on the box AI systems have come online: Leon, a number of systems called "Jarvis" (because Iron Man), and instructions to build tools using LLM's.  Even my beloved smartphone OS manufacturer, Jolla, has announced they are innovating an on-the-box AI solution. Each of these cases is interesting for its own reasons, and I hope to investigate them further someday soon.

Not full funcionality? Then why bother?

Admittedly, there is a whole lot these AI's didn't do. Compared to Siri or Alexa, it was practically nothing.

But here is the thing. What they did do, was keep my data private. That, to me, was more important than any futuristic bells-and-whistles fiasco that serves as a ploy to install open microphones in my home, feeding someone else's databanks.

Also, around this time, someone informed me of corporate research conducted with everyday users of these virtual assistants like Alexa. They basically found that 90% of what people do with these things is set timers and listen to music. Maybe the last 10% is turn the lights on and off at home. For this, people put an open mic in their bedrooms.

I can turn my own lights, timers, and music: it's no big deal to me. So if that's all I need my onboard "AI"'s to do, while keeping my data onboard where I can see it? Then that's enough for me.

Related posts