Synthetic data is the safe, low-cost substitute for real data that we require

Synthetic data is the safe, low-cost substitute for real data that we require

Image Credits : Pixabay

A new way to feed information to AI’s

Babies learn to talk by hearing other people—mostly their parents—make sounds. Repetition and pattern discovery help infants associate sounds with meaning. They learn to make human-understandable sounds with practice.

Machine learning algorithms use data painstakingly categorized by thousands of humans to teach the machine what it means.

Anyone can see that the current process is flawed. Now what? How can we generate enough privacy-respecting, non-problematic, all-eventuality-covering, accurately-labeled data? More AI.

Synthetic data is the safe, low-cost substitute for real data that we require
Synthetic data is the safe, low-cost substitute for real data that we require admin | Digithots

Fake data helps AI’s handle real data

Take Alphabet’s Waymo. They created a completely simulated world where simulated cars with simulated sensors could drive endlessly, collecting real data. By 2020, the company had simulated 15 billion miles compared to 20 million real-world miles.

This is synthetic data, or “data applicable to a given situation that is not obtained by direct measurement,” in AI terms. More simply: AI’s are creating fake data to help other AI’s learn about the real world faster.

Making fake data

The most popular method for synthetic data generation is GAN, or generative adversarial networks. GANs match two AIs. One AI generates synthetic data, while the other verifies it. The latter trains the former to make more convincing fake data. You’ve probably seen one of the many this-X-does-not-exist websites that use GANs to generate images of people, cats, and buildings.

Fake data is real data without the realness

Synthetic data has many advantages over real-world data. First, you can collect more of it without relying on humans. Second, synthetic data is perfectly labeled, eliminating the need for labor-intensive data centres to mislabel data. Synthetic data protects privacy and copyright. Finally, it reduces bias.

Overall, synthetic and real-world stoplights point to synthetic data dominating the training of privacy-friendly and safer AI models in the near future, which could lead to smarter AI’s for us.