Cifar10 Convolutional Neural Network Parallel Training on Google Cloud
This application demonstrates parallel training of a convolutional neural network (CNN) model using the Cifar10 dataset on Google Cloud. It utilizes wfl and the Google Cloud Batch API for training multiple model instances concurrently and stores the output in a Google Cloud Storage bucket. It uses Spot instances for reducing the cloud costs involved.
Prerequisites
- A Google Cloud account with billing and APIs enabled
- A Google Cloud Storage bucket
- Google Cloud SDK installed and configured on your local machine
- Go installed on your local machine
Setup
- Clone this repository:
git clone https://github.com/dgruber/wfl.git
cd wfl/examples/convolutionalnn_googlebatch
- Set the required environment variables:
export GOOGLE_PROJECT="your-google-project-id"
export GOOGLE_BUCKET="your-google-bucket-name"
Replace your-google-project-id
and your-google-bucket-name
with your Google Cloud project ID and Google Cloud Storage bucket name, respectively.
- Build the container image and push to Google Container Registry:
make build
make push
Run
Execute the wfl application:
make run
The application will perform the following steps:
- Create a Google Batch context for running training jobs using your specified Google Cloud project and bucket.
- Run a data preparation job to split the Cifar10 dataset into multiple parts for parallel training.
- Submit multiple parallel training jobs using different parts of the dataset.
- Wait for all training jobs to complete.
- Print the accuracy and runtime of each training job.
Output
The application will print the progress, status, and results of each job in the console. The trained model files, accuracy results, and logs will be stored in the specified Google Cloud Storage bucket.
Customization
You can customize the number of parallel training jobs, machine types, and other job parameters by modifying the cifar.go
file and rebuilding the application.