Automatically Curated Data Improves Training of Machine Learning Models to Produce Better Results

Today, Machine Learning (ML) and Deep Learning (DL) enable organizations to solve problems and perform tasks that, in the past, they haven’t been able to perform easily. ML/DL models learn what to do from operational data, with surprising accuracy, to help us make better decisions.

But, there’s a catch. These models need a lot of data. In the case of supervised learning, they need lots of curated data. Data curation provides information about the collected data. Unfortunately, curating data is a challenge. Although data exists, it probably isn’t curated. It is a monumental, manual effort often involving many subject matter experts (SMEs) to curate data. Without curated data, supervised training of ML/DL models can’t be done.

To address this challenge, we developed an approach (based on a tool from Stanford University) to efficiently automate data curation of large datasets that are needed to train ML/DL models. We were awarded a Small Business Innovation Research (SBIR) grant by the Navy to create and test the concept. Our goal was to significantly reduce the time and cost associated with manually curating data—and reduce the number of SMEs. Additionally, we wanted to demonstrate that our automatically curated dataset is as good as or better than a manually curated dataset.

We started by manually curating a dataset of ship images used in our Machine Learning Ship Recognition Project to establish a baseline. As you would expect, ship recognition and classification are important tasks on Navy platforms. Identifying different ships (e.g., friendly or threat) quickly and accurately helps Commanding Officers make better and faster decisions.

Our next step was to use our data curation tool to label the same ship image dataset for the Ship Recognition ML model. We could then compare results between the automatically curated and the manually curated images. So, how did we do? Our automated labeling of an image took less than 4 seconds, whereas manual labeling typically took over 25 seconds per image. Using this data in our Ship Classification model, we correctly identified about 85% of the images.

Initial testing shows that the automated data curation tool can label much larger image datasets significantly faster than the manual approach. Over six times faster!* With supervised ML model training, more data should improve the ML model’s results. In other words, more data should yield better results. We verified this, as well. We trained our Ship Recognition ML model using small and large image datasets. As expected, the ship classification accuracy improved when we used the larger dataset.

The takeaway from our SBIR is that automatically curated data will allow us to use larger datasets to train ML models to yield more accurate results. Plus, automation significantly reduced cost and time and will allow us to utilize new operational data faster to better support the warfighter. Going forward, we will extend our approach to curate other data types to support a wider variety of ML models for other problems and Navy challenges.

*Results are based on using a mid-range laptop. Results will vary depending on the hardware computing resources used to perform automated data curation.