Case Studies

A Jupyter Notebook to Start (and Finish) Data Science Projects Faster

Jupyter notebooks on a laptop

"Time-to-delivery" is one of the most important yet unsung metrics in data science. It's tricky to track, but thinking about it conceptually can help us identify ways to (a) help data scientists work faster and (b) allow business stakeholders to get the information they need faster.

There are many ways you can reduce your analytics team's time to insight. Today, we'll focus on reducing the time to complete simple data science tasks.

The way we've approached it is by creating a Google Colab Quickstart Notebook. It provides a simple notebook setup and code snippets for common data science tasks. Specifically, we cover:

  • Notebook Setup

  • Importing Data

  • Profiling Data

  • Visualizing Data

  • Exporting Data

  • Resources

→ Open the Quickstart Notebook in Google Colab


1. Setting Up the Notebook

The snippets below are commonly used in the header of a Google Colab notebook — import your libraries, authenticate, and declare global variables here.

Authenticating Google Colab

This snippet connects the Colab instance to your Google profile, giving the instance the same authentication as your personal account.

# Provide your credentials to the runtime
from google.colab import auth
auth.authenticate_user()
# Provide your credentials to the runtime
from google.colab import auth
auth.authenticate_user()
# Provide your credentials to the runtime
from google.colab import auth
auth.authenticate_user()

Loading Colab Settings

You can customize your dataframe displays, use cell magic, make bash calls, and more with a simple set of API calls.

# Enable interactive dataframe display for large pandas dataframes
%load_ext google.colab.data_table

# To revert to classic dataframe view:
%unload_ext google.colab.data_table
# Enable interactive dataframe display for large pandas dataframes
%load_ext google.colab.data_table

# To revert to classic dataframe view:
%unload_ext google.colab.data_table
# Enable interactive dataframe display for large pandas dataframes
%load_ext google.colab.data_table

# To revert to classic dataframe view:
%unload_ext google.colab.data_table

Installing nbextensions

Notebook extensions (nbextensions) add helpful functionality to Jupyter Notebooks — particularly useful when trying to visualize HTML or interactive elements inside Colab.

!jupyter contrib nbextension install --user
!jupyter contrib nbextension install --user
!jupyter contrib nbextension install --user


2. Loading Data

Inputting and outputting data is one of the most common tasks for a data scientist, but setting up the connections can be time-consuming. The Quickstart Notebook includes snippets for:

  • From BigQuery → DataFrame

  • From DataFrame → BigQuery

  • From DataFrame → Google Sheets

  • From Colab → Local directory (upload and download)

  • From Colab → Google Cloud Storage

For more examples, see Google's official notebook: External data: Local Files, Drive, Sheets, and Cloud Storage.

From BigQuery to a DataFrame

In March 2023, the BigQuery Client API was updated to allow users to output a query directly as a DataFrame — removing the need for pandas-gbq.

from google.cloud import bigquery
import pandas as pd

client = bigquery.Client(project=GOOGLE_PROJECT_ID)

sql = """
SELECT name, SUM(number) AS count
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
ORDER BY count DESC
LIMIT 10
"""

# Execute query and load directly to a DataFrame
df_names = client.query(sql).to_dataframe()
df_names.head()
from google.cloud import bigquery
import pandas as pd

client = bigquery.Client(project=GOOGLE_PROJECT_ID)

sql = """
SELECT name, SUM(number) AS count
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
ORDER BY count DESC
LIMIT 10
"""

# Execute query and load directly to a DataFrame
df_names = client.query(sql).to_dataframe()
df_names.head()
from google.cloud import bigquery
import pandas as pd

client = bigquery.Client(project=GOOGLE_PROJECT_ID)

sql = """
SELECT name, SUM(number) AS count
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
ORDER BY count DESC
LIMIT 10
"""

# Execute query and load directly to a DataFrame
df_names = client.query(sql).to_dataframe()
df_names.head()

From a DataFrame to BigQuery

It's often helpful to write a DataFrame back to BigQuery — for example, recording model outputs back to your database. Use bigquery.LoadJobConfig to define schema details including field types, modes, and descriptions.

import pandas as pd
from google.cloud import bigquery

# Sample dataset available in Google Colab
df = pd.read_csv('/content/sample_data/california_housing_test.csv')

# Specify the destination table ID
table_id = "my-project-id.test.california_housing_test"

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField(
            "households",
            bigquery.enums.SqlTypeNames.INTEGER,
            description="The number of households in the geographic area"
        ),
        bigquery.SchemaField(
            "median_income",
            bigquery.enums.SqlTypeNames.FLOAT,
            description="The log of the median income"
        ),
    ],
    # WRITE_TRUNCATE replaces the table; default behavior appends rows
    write_disposition="WRITE_TRUNCATE",
)

job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()  # Wait for the job to complete

table = client.get_table(table_id)
print("Loaded {} rows and {} columns to {}".format(
    table.num_rows, len(table.schema), table_id
))
import pandas as pd
from google.cloud import bigquery

# Sample dataset available in Google Colab
df = pd.read_csv('/content/sample_data/california_housing_test.csv')

# Specify the destination table ID
table_id = "my-project-id.test.california_housing_test"

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField(
            "households",
            bigquery.enums.SqlTypeNames.INTEGER,
            description="The number of households in the geographic area"
        ),
        bigquery.SchemaField(
            "median_income",
            bigquery.enums.SqlTypeNames.FLOAT,
            description="The log of the median income"
        ),
    ],
    # WRITE_TRUNCATE replaces the table; default behavior appends rows
    write_disposition="WRITE_TRUNCATE",
)

job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()  # Wait for the job to complete

table = client.get_table(table_id)
print("Loaded {} rows and {} columns to {}".format(
    table.num_rows, len(table.schema), table_id
))
import pandas as pd
from google.cloud import bigquery

# Sample dataset available in Google Colab
df = pd.read_csv('/content/sample_data/california_housing_test.csv')

# Specify the destination table ID
table_id = "my-project-id.test.california_housing_test"

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField(
            "households",
            bigquery.enums.SqlTypeNames.INTEGER,
            description="The number of households in the geographic area"
        ),
        bigquery.SchemaField(
            "median_income",
            bigquery.enums.SqlTypeNames.FLOAT,
            description="The log of the median income"
        ),
    ],
    # WRITE_TRUNCATE replaces the table; default behavior appends rows
    write_disposition="WRITE_TRUNCATE",
)

job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()  # Wait for the job to complete

table = client.get_table(table_id)
print("Loaded {} rows and {} columns to {}".format(
    table.num_rows, len(table.schema), table_id
))

Tip: Field descriptions defined in SchemaField are stored and visible in the BigQuery console — a small habit that pays dividends when others use your tables later.

From a DataFrame to Google Sheets

Loading results to Google Sheets is particularly valuable in a business context. Many stakeholders are comfortable analyzing data in Sheets or Excel but aren't comfortable with Python. Bridge that gap by completing the heavy lifting in Python and surfacing results in Sheets.

import gspread
from google.auth import default

creds, _ = default()
gc = gspread.authorize(creds)
import gspread
from google.auth import default

creds, _ = default()
gc = gspread.authorize(creds)
import gspread
from google.auth import default

creds, _ = default()
gc = gspread.authorize(creds)

Get the spreadsheet object by ID, URL, or name:

# By ID
spreadsheet_id = '1VlIdq73MUBPSJdVcpZMGD1Dj-VuksY_CNH3r5DIe0cs'
spreadsheet = gc.open_by_key(spreadsheet_id)

# By URL
spreadsheet_url = "https://docs.google.com/spreadsheets/d/{}".format(spreadsheet_id)
spreadsheet = gc.open_by_url(spreadsheet_url)

# By name
spreadsheet = gc.open('My Analysis Spreadsheet')
# By ID
spreadsheet_id = '1VlIdq73MUBPSJdVcpZMGD1Dj-VuksY_CNH3r5DIe0cs'
spreadsheet = gc.open_by_key(spreadsheet_id)

# By URL
spreadsheet_url = "https://docs.google.com/spreadsheets/d/{}".format(spreadsheet_id)
spreadsheet = gc.open_by_url(spreadsheet_url)

# By name
spreadsheet = gc.open('My Analysis Spreadsheet')
# By ID
spreadsheet_id = '1VlIdq73MUBPSJdVcpZMGD1Dj-VuksY_CNH3r5DIe0cs'
spreadsheet = gc.open_by_key(spreadsheet_id)

# By URL
spreadsheet_url = "https://docs.google.com/spreadsheets/d/{}".format(spreadsheet_id)
spreadsheet = gc.open_by_url(spreadsheet_url)

# By name
spreadsheet = gc.open('My Analysis Spreadsheet')

Then select a worksheet and load it into a DataFrame:

# Select worksheet by index (0-based) or by name
worksheet = spreadsheet.get_worksheet(3)
worksheet = spreadsheet.worksheet('product_information')

# Load all rows to a DataFrame
rows = worksheet.get_all_values()
df_products = pd.DataFrame(rows[1:], columns=rows[0])
# Select worksheet by index (0-based) or by name
worksheet = spreadsheet.get_worksheet(3)
worksheet = spreadsheet.worksheet('product_information')

# Load all rows to a DataFrame
rows = worksheet.get_all_values()
df_products = pd.DataFrame(rows[1:], columns=rows[0])
# Select worksheet by index (0-based) or by name
worksheet = spreadsheet.get_worksheet(3)
worksheet = spreadsheet.worksheet('product_information')

# Load all rows to a DataFrame
rows = worksheet.get_all_values()
df_products = pd.DataFrame(rows[1:], columns=rows[0])

Uploading Files to Colab from Your Local Computer

This generates a pop-up that lets you browse and select files to upload. Once uploaded, the file will appear in the /content/ folder.

from google.colab import files

uploaded = files.upload()
from google.colab import files

uploaded = files.upload()
from google.colab import files

uploaded = files.upload()

Downloading Files from Colab to Your Local Computer

The inverse operation — download any file from the Colab instance to your machine:

from google.colab import files

files.download('/content/sample_data/california_housing_test.csv')
from google.colab import files

files.download('/content/sample_data/california_housing_test.csv')
from google.colab import files

files.download('/content/sample_data/california_housing_test.csv')

From Colab to Google Cloud Storage

Google Cloud Storage (GCS) is useful for persisting unstructured data (text, images, audio) on a cloud server. It saves local disk space and makes collaboration easier through a centralized data source. You can interact with GCS using gsutil or the Google Python API.

Step 1: Authenticate and create a bucket

from google.colab import auth
auth.authenticate_user()

project_id = 'my-google-project-id'

import uuid
bucket_name = 'colab-sample-bucket-' + str(uuid.uuid1())
print(bucket_name)
from google.colab import auth
auth.authenticate_user()

project_id = 'my-google-project-id'

import uuid
bucket_name = 'colab-sample-bucket-' + str(uuid.uuid1())
print(bucket_name)
from google.colab import auth
auth.authenticate_user()

project_id = 'my-google-project-id'

import uuid
bucket_name = 'colab-sample-bucket-' + str(uuid.uuid1())
print(bucket_name)

Step 2: Upload files using gsutil

# Set the project
!gcloud config set project {project_id}

# Create the bucket
!gsutil mb gs://{bucket_name}

# Upload a file
!gsutil cp /content/sample_data/mnist_test.csv gs://{bucket_name}/

# Verify the upload
!gsutil cat

# Set the project
!gcloud config set project {project_id}

# Create the bucket
!gsutil mb gs://{bucket_name}

# Upload a file
!gsutil cp /content/sample_data/mnist_test.csv gs://{bucket_name}/

# Verify the upload
!gsutil cat

# Set the project
!gcloud config set project {project_id}

# Create the bucket
!gsutil mb gs://{bucket_name}

# Upload a file
!gsutil cp /content/sample_data/mnist_test.csv gs://{bucket_name}/

# Verify the upload
!gsutil cat

Step 3: Download files from GCS

!gsutil cp
!gsutil cp
!gsutil cp


3. Profiling Data

Once data is loaded, you need a high-level understanding of its shape, quality, and quirks — accounting for missing, inaccurate, or outlying values. This is where data profiling comes in.

A "one size fits all" approach to EDA rarely works well. That said, the following libraries are genuinely useful for quickly building a base understanding of any dataset:

  • ydata-profiling (formerly pandas-profiling)

  • sweetviz

  • AutoViz

ydata-profiling (formerly pandas-profiling)

Note (2024 update): pandas-profiling has been renamed to ydata-profiling. Update your install accordingly.

# Install ydata-profiling (the modern replacement for pandas-profiling)
!pip install ydata-profiling

# Enable widget display
!jupyter nbextension enable --py widgetsnbextension

import pandas as pd
from ydata_profiling import ProfileReport

# Generate profiling report
profile = ProfileReport(
    df_ohlc,
    title="SPY Data",
    html={'style': {'full_width': True}}
)

# Save to file
profile.to_file(output_file="profiling_report.html")

# Display inside Colab
import IPython
IPython.display.HTML(filename='profiling_report.html')
# Install ydata-profiling (the modern replacement for pandas-profiling)
!pip install ydata-profiling

# Enable widget display
!jupyter nbextension enable --py widgetsnbextension

import pandas as pd
from ydata_profiling import ProfileReport

# Generate profiling report
profile = ProfileReport(
    df_ohlc,
    title="SPY Data",
    html={'style': {'full_width': True}}
)

# Save to file
profile.to_file(output_file="profiling_report.html")

# Display inside Colab
import IPython
IPython.display.HTML(filename='profiling_report.html')
# Install ydata-profiling (the modern replacement for pandas-profiling)
!pip install ydata-profiling

# Enable widget display
!jupyter nbextension enable --py widgetsnbextension

import pandas as pd
from ydata_profiling import ProfileReport

# Generate profiling report
profile = ProfileReport(
    df_ohlc,
    title="SPY Data",
    html={'style': {'full_width': True}}
)

# Save to file
profile.to_file(output_file="profiling_report.html")

# Display inside Colab
import IPython
IPython.display.HTML(filename='profiling_report.html')

sweetviz

Sweetviz generates comprehensive HTML reports covering structure, distributions, and relationships within a dataset. It's especially powerful for comparing two datasets (e.g., train vs. test splits).

!pip install sweetviz

import sweetviz as sv

# Basic analysis
my_report = sv.analyze(df_ohlc)
my_report.show_html('sweetviz_report.html')

import IPython
IPython.display.HTML(filename='sweetviz_report.html')
!pip install sweetviz

import sweetviz as sv

# Basic analysis
my_report = sv.analyze(df_ohlc)
my_report.show_html('sweetviz_report.html')

import IPython
IPython.display.HTML(filename='sweetviz_report.html')
!pip install sweetviz

import sweetviz as sv

# Basic analysis
my_report = sv.analyze(df_ohlc)
my_report.show_html('sweetviz_report.html')

import IPython
IPython.display.HTML(filename='sweetviz_report.html')

Comparing two datasets (useful for train/test split analysis or A/B comparisons):

import pandas_datareader.data as web

df_aapl = web.DataReader('AAPL', 'stooq', start='01/01/2023', end='06/01/2023')
df_amzn = web.DataReader('AMZN', 'stooq', start='01/01/2023', end='06/01/2023')

import sweetviz as sv

# Compare two datasets — name them and specify an optional target variable
my_report = sv.compare([df_aapl, 'Apple'], [df_amzn, 'Amazon'], 'Close')
my_report.show_html(filepath='sweetviz_comparison_report.html')

# Train/test split example:
# my_report = sv.compare([df_train, 'Train'], [df_test, 'Test'], 'target_var')

import IPython
IPython.display.HTML(filename='/content/sweetviz_comparison_report.html')
import pandas_datareader.data as web

df_aapl = web.DataReader('AAPL', 'stooq', start='01/01/2023', end='06/01/2023')
df_amzn = web.DataReader('AMZN', 'stooq', start='01/01/2023', end='06/01/2023')

import sweetviz as sv

# Compare two datasets — name them and specify an optional target variable
my_report = sv.compare([df_aapl, 'Apple'], [df_amzn, 'Amazon'], 'Close')
my_report.show_html(filepath='sweetviz_comparison_report.html')

# Train/test split example:
# my_report = sv.compare([df_train, 'Train'], [df_test, 'Test'], 'target_var')

import IPython
IPython.display.HTML(filename='/content/sweetviz_comparison_report.html')
import pandas_datareader.data as web

df_aapl = web.DataReader('AAPL', 'stooq', start='01/01/2023', end='06/01/2023')
df_amzn = web.DataReader('AMZN', 'stooq', start='01/01/2023', end='06/01/2023')

import sweetviz as sv

# Compare two datasets — name them and specify an optional target variable
my_report = sv.compare([df_aapl, 'Apple'], [df_amzn, 'Amazon'], 'Close')
my_report.show_html(filepath='sweetviz_comparison_report.html')

# Train/test split example:
# my_report = sv.compare([df_train, 'Train'], [df_test, 'Test'], 'target_var')

import IPython
IPython.display.HTML(filename='/content/sweetviz_comparison_report.html')


4. Visualizing Data

One of the main challenges with plotting in Jupyter is that customizing charts takes a lot of code and time. Creating interactive charts is even harder.

We focus on Pareto efficiencies here: Colab's Chart Suggestions feature is our go-to for quick plots. For more sophisticated visualizations, the following notebooks are excellent starting points:

Colab Chart Suggestions

Colab Chart Suggestions is a fast way to build charts. It uses matplotlib and seaborn under the hood but abstracts away the complex configuration. The workflow:

  1. Create a Pandas DataFrame

  2. Click the Chart Suggestions icon (chart icon in the DataFrame output)

  3. Select the chart type you want

  4. Click Accept — Colab generates a ready-to-run code block

This alone can save 20–30 minutes per analysis session.


5. Resources

Using templated code lets your analytics team start and finish work faster across the board. Below are the resources referenced in this article and the Quickstart Notebook.

Importing & Exporting Data

Profiling Data

Visualizing Data

Useful Notebooks


Wrapping Up

The modules above form a foundation for quickly starting any data science project. From there, we layer on method-specific templates depending on the analysis at hand — whether that's a statistical model, a time series forecast, or an ML pipeline.

The principle is simple: the less time your team spends on setup, the more time they spend on insight.

→ Get the Quickstart Notebook on Google Colab



The Data Strategist helps startups and scaleups become data-driven. Get a data scientist on-demand, or advice on building your analytical data stack.