Cloud Solutions for GSE Research and Data Analysis

Cloud Solutions for GSE Research and Data Analysis
#7A863B
Project Type: 
Data Science
Topics: 
Data Infrastructure
Data Analysis
Client: 
Various Research Labs
Duration: 
6 months
Contact: 
Wilson Wang
Overview

Aligned with the creation of the Education Data Science academic pathways at the Graduate School of Education in 2022, GSE IT has renewed efforts to support faculty, students, and staff with tools and infrastructure for data storage and analysis.

Then & Now
History

Since 2012, Stanford and GSE IT have supported the use of MySQL and Windows SQL Servers hosted on campus to store research data. Recent advancement in virtualization and cloud infrastructure providers allowed GSE IT to explore newer alternatives. With the help of Stanford Research Computing Center (SRCC) and University IT (UIT), many offerings such as BigQuery and Colab on Google Cloud Platform allow researchers to move their workflows from Stanford campus to the cloud.

Benefits of the cloud

Faculty researchers no longer need to worry about physical servers and purchasing hardware. Cloud-based, fully-managed data warehouses offer instant scalability, high-performance querying, and serverless architecture by charging a pay-by-use fee. Cloud resources can be allocated instantaneously for researchers so they can run their analyses with minimal startup cost. For example, Jupyter Notebook environments on Google Colab and Sherlock allow convenient analysis of data and provide a collaboration space for researchers.

Pricing

The price of storage is based on various factors, such as the size of data stored and frequency of access. Prices can be estimated using Google Cloud Pricing Calculator. For example, it would cost $5 to store 200GB of data on BigQuery per month. It would cost an additional $5 per TB of data traversed per query to perform data analysis. For a stand-alone cloud compute instance, e.g. for private Jupyter Notebook use, it is around $252 per month for a NVIDIA T4 GPU on Google Cloud. Prices are as of July 2023. Various startup grants at the University are available to cover cloud computing costs for faculty research groups.

Google Cloud pricing calculator
Outcome

GSE IT had successful implementations with three groups: Mitchell Stevens’ Pathways Lab, Dan McFarland’s MIMIR Lab, and Ben Domingue’s PACE Lab. All three groups have data warehousing and data analysis infrastructure into the cloud. For high-risk data, the projects are created within SRCC’s Nero environment, with additional firewalls and IP restrictions in place to protect the data. For low- and medium-risk data, University IT provisions regular cloud accounts. Payments are processed through faculty PTA accounts.

GSE IT has also procured Sherlock for GSE community use as of September 2023.

Sherlock OnDemand
Options for Cloud Compute/Data Analysis
  • Sherlock is a shared, high performance computing cluster for the Stanford research community. In 2023, GSE IT has procured a dedicated Sherlock node for GSE researchers to use, available until 2027. Access is granted via faculty/PI. Researchers may use Sherlock OnDemand to access interactive Jupyter Notebook and schedule jobs.
  • Nero GCP allows faculty researchers with high-risk data to request instances of high-powered virtual machines for their data analysis and storage. Additional documentation on data safety must be provided.
  • Google Colab (Private VM runtime on Google Cloud) can be created within each Google Cloud project and provisioned for private Colab use. This is distinct from the publicly available Google Colab and Google Colab Pro subscription model. Researchers can pay for their custom GCE VM runtime and connect Colab to the runtime.
  • Hugging Face is an online platform where you can build, train, and deploy models and pay hourly for the compute power used. Hugging Face can also be used to share models and datasets.