Data about the real world is messy. Microsoft’s experiment with Tay was an extreme example of what’s still a very common challenge in enterprise AI: available real-world training data often leads to substandard model performance. Companies developing computer vision models know this all too well; and the stakes are high given estimates that the CV market will grow to $145B in 2030 compared to $7B in 2020. Collection and annotation of training data represent 60% of that spend in large part because getting the data right is still an unsolved problem.
Synthetic data has emerged in response to the shortcomings of real-world datasets. We believe training computer vision models with synthetic data is going to become the industry standard because of its clear advantages over manual collection.
Today we’re announcing our investment in Datagen, the leader in synthetic images and video for computer vision use cases. Their platform generates photorealistic labeled datasets based on specific criteria and distributions, allowing its enterprise customers to train and test their machine learning models more efficiently.
We’re particularly excited to work with Datagen. The company joins a growing list of Scale investments in Cognitive Apps, next-gen enterprise systems that make use of machine learning, connectivity with other systems, and novel forms of data capture and use. They’re also in good company with other Scale investments in Israel like JFrog, WalkMe, and Papaya Global.
The Future of Computer Vision Training Data
There are multiple reasons it’s hard to get good training data for CV. Real-life pictures of people fall under a growing patchwork of local, national, and international privacy regulations. Public datasets get you part of the way there but tend to be too generalized for a given use case. Then after acquiring and annotating the right data, you’re stuck with a static asset when most models require frequent, dynamic updates to perform optimally.
Datagen solves these problems by enabling granular control over your data, eliminating bias and privacy issues, and perfectly annotating domain-specific data so that companies can increase their dataset size and diversity easily and improve their models across all CV use cases.
Datagen’s long-term product thinking really stands out to us. They are the only CV synthetic data platform that is entirely self-serve, so customers can do everything through the app without needing to provide seed data to get started. This is possible because they’re building a highly scalable platform in terms of their infrastructure and content library, allowing them to expand across CV use cases without needing to work with individual customers first.
Datagen’s product advantages can be measured. In our conversations with Datagen customers, they’re measuring model performance by comparing fully real-world datasets versus a mix of real-world and Datagen synthetic. Datagen customers are consistently seeing increases in model performance with the addition of synthetic data.
Datagen Growing Rapidly
The Datagen product greatly simplifies the creation of custom datasets, hiding very real technical complexity behind a user-friendly UI. It’s a credit to the leadership of the company’s co-founders, CEO Ofir Chakon and CTO Gil Elbaz, who both earned advanced engineering degrees before founding Datagen. That acumen shows in their early customer list, spanning companies in industries as varied as electronics/tech, automotive, and consumer products.
Datagen has made tremendous progress in a highly technical arena in the three years since it was founded in Tel Aviv. We look forward to working with Ofir and Gil, the Datagen team, and the company’s investors in the years ahead.