Instead of being the beginning of a world full of digital assistants making our lives easier, Alexa has ushered in doubt and worry. It turns on when it shouldn’t. It listens when we don’t want it to. Alexa’s underlying machine learning models aren’t perfect–and to be fair, nobody expected they would be–but training those models requires data we create in the privacy of our own homes. That data needs to be annotated. And that means, somewhere, someone we don’t know is listening to us talk to a computer we might not know was even recording us in the first place.
The Adam Clark Estes piece we linked above spells out these issues in great detail, but the second paragraph lays out some of the most troubling stuff:
“Privacy advocates have filed a complaint with the Federal Trade Commission (FTC) alleging these devices violate the Federal Wiretap Act. Journalists have investigated the dangers of always-on microphones and artificially intelligent voice assistants. Skeptical tech bloggers like me have argued these things were more powerful than people realized and overloaded with privacy violations. Recent news stories about how Amazon employees review certain Alexa commands suggest the situation is worse than we thought.”
Now, a lot of us who work in tech understand that data is the bedrock for myriad companies and solutions out there. Collecting that data essentially makes those companies money and it makes their solutions smarter. But consumers aren’t thrilled with the idea that their every behavior is being cataloged, stored, analyzed, and sometimes viewed and labeled by a third party somewhere across the world.
Putting aside the privacy argument–not dismissing it, just setting it aside for a moment–Alexa also highlights a problem in the data science and machine learning collection: data hoarding.
Just think about how much data is created by Alexa users. There are over 100 million Alexa devices in the wild today. About 66 million people have a smart speaker and Amazon’s Echo accounts for about 60% of that. That’s a truly staggering amount of voice data on its face. But now think about what we mentioned above: sometimes Alexa devices just turn on. They’re collecting more data, some of it potentially illegal, others of it just of poor quality generally.
Now, sure, some of that data is going to be really valuable, useful, informative data Amazon can use to make Alexa more sophisticated. No one denies that (though again, there are some privacy concerns). The issue is that not all data is created equal. Some of that data is actively bad, in fact.
Collecting all of that data, well, that’s a pervasive habit in our industry. But what you end up with is a scenario where it can become difficult to understand what the helpful data is and what the useless or harmful data looks like. The intent might’ve been to build better models from that galaxy of data but now you’re left with a huge data value problem. And that means you’ll need an advanced and extensive data curation process just to uncover the data that will actually make your models better.
Now, Amazon’s not exactly hurting for capital but money alone doesn’t solve the problem. But for the rest of us, the consequence of hoarding too much data and assuming it’ll all be helpful or that we can work that out later is a bit more frightening. Vacuuming up every piece of data you can runs the risk of polluting the models you build off that data.
It’s why we believe that both data collection and curation is so critical for business success. Understanding which data will be useful for your project and which data isn’t is essential. Having a data curation strategy means that you can form a data collection process from that. Let us explain:
Say, for example, you’re working on autonomous vehicle models. That’s a data greedy problem but it’s also one where there’s a ton of harmful and redundant data that isn’t going to help your model. We can help you find what data really improves the behavior of your model now and over time as you continue training it. Understanding that, say, your model is greedy for rural road data means you can send drivers out to collect that data instead of squandering that valuable resource sitting in traffic on I-80. As the data your model needs evolve, your collection strategy can evolve with it.
In other words, having a data collection and curation strategy is just good business. Will that make your company as successful as Amazon? That’s a tall order. But you won’t find yourself wasting time training good models on bad data. And that’s certainly a good thing.