Using Python with Gretel.ai to Generate Synthetic Location Data
Header Photo Credit: sylv1rob1 via ShutterStock*
How Gretel.ai trained a FastCUT GAN using Python to generate realistic synthetic location data for any city in the world.
At Gretel.ai, our mission is to make it fast and easy for developers and data scientists to create production-grade synthetic data. To achieve this, we’ve designed a series of APIs that allow anyone to get up and running within minutes so they can identify, transform and generate the data necessary to fuel the testing of modern software applications and AI/ML models. Python is the engine that powers much of Gretel’s research, development, and deployment of our APIs and toolkit. From a user experience perspective, Python’s extensive libraries and frameworks (e.g., Scikit-learn and TensorFlow for machine learning, Spacy for text processing, and Numpy for data exploration), its ability to handle complex data structures, and its turnkey integrations help us ensure Gretel’s platform is easy to use and extensible to any workflow or project.
In this post, we highlight how–with the support of Python–we created a GAN location generator that can use map images and geolocation data to create new synthetic training data that can help the model predict where a human (or an e-bike in this case) might be, for any location in the world, with a high degree of statistical accuracy. This proof of concept for making better predictions by combining and contextualizing different types of data has applications across industries – such as improving medical diagnosis and financial market forecasts and even building realistic simulations in the metaverse.
If you want to try this experiment yourself, all the tools, code, and data are open-sourced and available on GitHub.
An Overview of the Process
Generating realistic location data for users for testing or modeling simulations is a hard problem. Current approaches just create random locations inside a box, placing users in waterways or on top of buildings. This inability to make accurate, synthetic location data stifles a lot of innovative projects that require diverse and complex datasets to fuel their work.
Gretel’s approach is to model this problem by encoding e-bike location data as pixels in an image, and then training that data as an image translation task similar to CycleGAN, Pix2pix, and StyleGAN. For this study, we used the newer contrastive unpaired translation (FastCUT) model that was created by the authors of pix2pix and CycleGAN as it’s memory-efficient, fast for training (i.e., useful for higher-res locations), and generalizes well with minimal parameter tuning.
For this case study, we wanted to test if we could accurately predict the locations of scooters in one city based on training a GAN model using publicly available e-bike data from other cities.
To do this, we first fed our model image data of different city maps, including DC, Denver, and San Diego, then separately trained the model on tabular data of e-bike locations throughout those cities, which included time-series data that captured the flow of e-bike traffic, too. Here is an example of what the raw data looks like before and after it was combined:
Creating contextual learning by combining time-series and image data
There were three essential steps for training our model. First, we created the training data. To do this, we created a Domain A from a corpus of precise e-bike locations on a map, and a Domain B from the same maps, but without locations.
Next, we trained our FastCUT model on our new training data (which includes both the labeled and unlabeled map images). This was accomplished by training the model on translating Domain B → Domain A.
Lastly, once our initial model was trained, we generated our synthetic dataset, which we then used to further test and optimize the model’s predictions for realistic user locations for a new city map. This generative process required downloading new maps for a target location (Domain C), then running inference on the FastCUT model to predict scooter locations (in other words, translating Domain C → Domain A), and finally processing those images using OpenCV-Python to find e-bike locations and then convert them to geolocation (i.e., latitudinal/longitudinal) data points. With this information, we built our synthetic location dataset and we’re ready for testing.
The Results: San Diego → San Francisco → Tokyo
With our model trained on real-world San Diego e-bike data, we then repeated the same initial process of training our model with image data from various U.S. cities but then asked our model to predict the missing e-bike location data. The output was predictions that were 90% statistically correlated with the accurate, real-world locations of the actual scooters!
In other words, we successfully projected the relationships of those unique data attributes in a completely new environment, at a very different time. Comparison of real-world and synthetic e-bike locations
We tested the same process and model for the city of Tokyo, too, with similar positive results. Note that depending on where the city is in the world, the physical distance between each latitude or longitude degree can vary significantly, and we will need to use an ellipsoid-based model to calculate precise offsets when mapping pixels to locations. Fortunately, the geopy Python library makes this easy. E-bike locations predicted across downtown Tokyo
There were some definite false positives when viewing data across Tokyo, particularly with locations being generated for waterways. Perhaps further model tuning, or providing more negative examples of waterways in the training data (domain A or domain B) would reduce false positives. However, results are encouraging (given little model or dataset tuning) with the model seemingly able to mimic the distributions and locations of the e-bike dataset that it was trained on using maps from a different part of the world.
The important takeaway is that this is a proof of concept that you can make accurate predictions on synthetic location data from anywhere, at any time, because the underlying attributes are interchangeable (at least when dealing with e-bikes.)
In this post, we experimented with applying context from the visual domain (e.g. city map data) along with tabular data, to create realistic location data for cities around the globe. This exciting case study has far-reaching implications for the development of advanced software applications and powerful AI/ML models across industries. For example, someday, using similar machine learning techniques, healthcare practitioners could quickly process the text of written doctor’s notes, along with X-ray images, and then synthesize secure versions of that data which can be used to better diagnose and treat patients. With quality synthetic data (and a little Python code), the possibilities are endless.
Try the experiment yourself and let us know what you think would be an exciting use case for the FastCUT GAN location generator.
About the Author Alex Watson is a co-founder and CPO of Gretel.ai, the developer stack for synthetic data.