Reliable data necessary to train machine learning (ML) models is frequently difficult to obtain. Nearly all ML models and artificial intelligence (AI) applications are developed using actual or representative data. Within the highly sensitive and classified operating environments (i.e., defense and intelligence communities), the amount of available data is often insufficient to create performant ML models. Synthetic data generation tools have the potential to effectively address gaps in actual data sources – thereby improving the resulting performance of ML models and AI-enabled scenarios simulations. In the digital battlespace of the future, synthetic data generation techniques may help visualize unknown operating environments, generate data to train algorithmic models that analyze the operational picture, and enable improved AI capabilities.
Patriot Labs is interested in exploring customer and end-user opportunities requiring large-scale, synthetic data generation for sequential, time-series data (e.g., sensors, radars, networks, communications, imagery, etc.). Synthetic data inputs can reflect in high-impact, low-probability events and scenarios where data is especially hard to collect – allowing AI to learn from both real and simulated experiences.
Requirement evaluation criteria could be based on the quality of synthetic data measured against three dimensions: (1) fidelity (a measure of similarity between the synthetic and the original data), (2) utility (the performance of the synthetic data in downstream applications compared to the original dataset), and (3) privacy (statistical component replication and the degree of leaked information). Additionally, synthetization accuracy may be evaluated based on generated vs original data marginal distribution, granularity of sampling error, and constraints and attribute relationships. Requirements may include methods for effectively expanding the corpus of data available to the defense and intelligence community (and industry partners) for the training and testing ML models and AI applications. Synthetic data should be labeled as generated, thereby reducing additional data labeling efforts.
Approaches sought could include methods of applying synthetic data generation to train deep learning neural networks for accurate classification of corresponding objects and/or events of interest. Special consideration given to requirements that lead to the development of a workbench for accelerated synthetic data generation and data profiling of otherwise limited mission-focused sensor datasets.