"Time-to-delivery" is one of the most important yet unsung metrics in data science. It's tricky to track, but thinking about it conceptually can help us identify ways to (a) help data scientists work faster and (b) allow business stakeholders to get the information they need faster.
There are many ways you can reduce your analytics team's time to insight. Today, we'll focus on reducing the time to complete simple data science tasks.
The way we've approached it is by creating a Google Colab Quickstart Notebook. It provides a simple notebook setup and code snippets for common data science tasks. Specifically, we cover:
Notebook Setup
Importing Data
Profiling Data
Visualizing Data
Exporting Data
Resources
→ Open the Quickstart Notebook in Google Colab
1. Setting Up the Notebook
The snippets below are commonly used in the header of a Google Colab notebook — import your libraries, authenticate, and declare global variables here.
Authenticating Google Colab
This snippet connects the Colab instance to your Google profile, giving the instance the same authentication as your personal account.
from google.colab import auth
auth.authenticate_user()
from google.colab import auth
auth.authenticate_user()
from google.colab import auth
auth.authenticate_user()
Loading Colab Settings
You can customize your dataframe displays, use cell magic, make bash calls, and more with a simple set of API calls.
%load_ext google.colab.data_table
%unload_ext google.colab.data_table
%load_ext google.colab.data_table
%unload_ext google.colab.data_table
%load_ext google.colab.data_table
%unload_ext google.colab.data_table
Installing nbextensions
Notebook extensions (nbextensions) add helpful functionality to Jupyter Notebooks — particularly useful when trying to visualize HTML or interactive elements inside Colab.
!jupyter contrib nbextension install --user
!jupyter contrib nbextension install --user
!jupyter contrib nbextension install --user
2. Loading Data
Inputting and outputting data is one of the most common tasks for a data scientist, but setting up the connections can be time-consuming. The Quickstart Notebook includes snippets for:
From BigQuery → DataFrame
From DataFrame → BigQuery
From DataFrame → Google Sheets
From Colab → Local directory (upload and download)
From Colab → Google Cloud Storage
For more examples, see Google's official notebook: External data: Local Files, Drive, Sheets, and Cloud Storage.
From BigQuery to a DataFrame
In March 2023, the BigQuery Client API was updated to allow users to output a query directly as a DataFrame — removing the need for pandas-gbq.
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client(project=GOOGLE_PROJECT_ID)
sql = """
SELECT name, SUM(number) AS count
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
ORDER BY count DESC
LIMIT 10
"""
df_names = client.query(sql).to_dataframe()
df_names.head()
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client(project=GOOGLE_PROJECT_ID)
sql = """
SELECT name, SUM(number) AS count
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
ORDER BY count DESC
LIMIT 10
"""
df_names = client.query(sql).to_dataframe()
df_names.head()
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client(project=GOOGLE_PROJECT_ID)
sql = """
SELECT name, SUM(number) AS count
FROM `bigquery-public-data.usa_names.usa_1910_current`
GROUP BY name
ORDER BY count DESC
LIMIT 10
"""
df_names = client.query(sql).to_dataframe()
df_names.head()
From a DataFrame to BigQuery
It's often helpful to write a DataFrame back to BigQuery — for example, recording model outputs back to your database. Use bigquery.LoadJobConfig to define schema details including field types, modes, and descriptions.
import pandas as pd
from google.cloud import bigquery
df = pd.read_csv('/content/sample_data/california_housing_test.csv')
table_id = "my-project-id.test.california_housing_test"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField(
"households",
bigquery.enums.SqlTypeNames.INTEGER,
description="The number of households in the geographic area"
),
bigquery.SchemaField(
"median_income",
bigquery.enums.SqlTypeNames.FLOAT,
description="The log of the median income"
),
],
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()
table = client.get_table(table_id)
print("Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
))import pandas as pd
from google.cloud import bigquery
df = pd.read_csv('/content/sample_data/california_housing_test.csv')
table_id = "my-project-id.test.california_housing_test"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField(
"households",
bigquery.enums.SqlTypeNames.INTEGER,
description="The number of households in the geographic area"
),
bigquery.SchemaField(
"median_income",
bigquery.enums.SqlTypeNames.FLOAT,
description="The log of the median income"
),
],
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()
table = client.get_table(table_id)
print("Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
))import pandas as pd
from google.cloud import bigquery
df = pd.read_csv('/content/sample_data/california_housing_test.csv')
table_id = "my-project-id.test.california_housing_test"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField(
"households",
bigquery.enums.SqlTypeNames.INTEGER,
description="The number of households in the geographic area"
),
bigquery.SchemaField(
"median_income",
bigquery.enums.SqlTypeNames.FLOAT,
description="The log of the median income"
),
],
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()
table = client.get_table(table_id)
print("Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
))Tip: Field descriptions defined in SchemaField are stored and visible in the BigQuery console — a small habit that pays dividends when others use your tables later.
From a DataFrame to Google Sheets
Loading results to Google Sheets is particularly valuable in a business context. Many stakeholders are comfortable analyzing data in Sheets or Excel but aren't comfortable with Python. Bridge that gap by completing the heavy lifting in Python and surfacing results in Sheets.
import gspread
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)
import gspread
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)
import gspread
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)
Get the spreadsheet object by ID, URL, or name:
spreadsheet_id = '1VlIdq73MUBPSJdVcpZMGD1Dj-VuksY_CNH3r5DIe0cs'
spreadsheet = gc.open_by_key(spreadsheet_id)
spreadsheet_url = "https://docs.google.com/spreadsheets/d/{}".format(spreadsheet_id)
spreadsheet = gc.open_by_url(spreadsheet_url)
spreadsheet = gc.open('My Analysis Spreadsheet')
spreadsheet_id = '1VlIdq73MUBPSJdVcpZMGD1Dj-VuksY_CNH3r5DIe0cs'
spreadsheet = gc.open_by_key(spreadsheet_id)
spreadsheet_url = "https://docs.google.com/spreadsheets/d/{}".format(spreadsheet_id)
spreadsheet = gc.open_by_url(spreadsheet_url)
spreadsheet = gc.open('My Analysis Spreadsheet')
spreadsheet_id = '1VlIdq73MUBPSJdVcpZMGD1Dj-VuksY_CNH3r5DIe0cs'
spreadsheet = gc.open_by_key(spreadsheet_id)
spreadsheet_url = "https://docs.google.com/spreadsheets/d/{}".format(spreadsheet_id)
spreadsheet = gc.open_by_url(spreadsheet_url)
spreadsheet = gc.open('My Analysis Spreadsheet')Then select a worksheet and load it into a DataFrame:
worksheet = spreadsheet.get_worksheet(3)
worksheet = spreadsheet.worksheet('product_information')
rows = worksheet.get_all_values()
df_products = pd.DataFrame(rows[1:], columns=rows[0])
worksheet = spreadsheet.get_worksheet(3)
worksheet = spreadsheet.worksheet('product_information')
rows = worksheet.get_all_values()
df_products = pd.DataFrame(rows[1:], columns=rows[0])
worksheet = spreadsheet.get_worksheet(3)
worksheet = spreadsheet.worksheet('product_information')
rows = worksheet.get_all_values()
df_products = pd.DataFrame(rows[1:], columns=rows[0])Uploading Files to Colab from Your Local Computer
This generates a pop-up that lets you browse and select files to upload. Once uploaded, the file will appear in the /content/ folder.
from google.colab import files
uploaded = files.upload()
from google.colab import files
uploaded = files.upload()
from google.colab import files
uploaded = files.upload()
Downloading Files from Colab to Your Local Computer
The inverse operation — download any file from the Colab instance to your machine:
from google.colab import files
files.download('/content/sample_data/california_housing_test.csv')from google.colab import files
files.download('/content/sample_data/california_housing_test.csv')from google.colab import files
files.download('/content/sample_data/california_housing_test.csv')From Colab to Google Cloud Storage
Google Cloud Storage (GCS) is useful for persisting unstructured data (text, images, audio) on a cloud server. It saves local disk space and makes collaboration easier through a centralized data source. You can interact with GCS using gsutil or the Google Python API.
Step 1: Authenticate and create a bucket
from google.colab import auth
auth.authenticate_user()
project_id = 'my-google-project-id'
import uuid
bucket_name = 'colab-sample-bucket-' + str(uuid.uuid1())
print(bucket_name)
from google.colab import auth
auth.authenticate_user()
project_id = 'my-google-project-id'
import uuid
bucket_name = 'colab-sample-bucket-' + str(uuid.uuid1())
print(bucket_name)
from google.colab import auth
auth.authenticate_user()
project_id = 'my-google-project-id'
import uuid
bucket_name = 'colab-sample-bucket-' + str(uuid.uuid1())
print(bucket_name)
Step 2: Upload files using gsutil
!gcloud config set project {project_id}
!gsutil mb gs://{bucket_name}
!gsutil cp /content/sample_data/mnist_test.csv gs://{bucket_name}/
!gsutil cat
!gcloud config set project {project_id}
!gsutil mb gs://{bucket_name}
!gsutil cp /content/sample_data/mnist_test.csv gs://{bucket_name}/
!gsutil cat
!gcloud config set project {project_id}
!gsutil mb gs://{bucket_name}
!gsutil cp /content/sample_data/mnist_test.csv gs://{bucket_name}/
!gsutil cat
Step 3: Download files from GCS
3. Profiling Data
Once data is loaded, you need a high-level understanding of its shape, quality, and quirks — accounting for missing, inaccurate, or outlying values. This is where data profiling comes in.
A "one size fits all" approach to EDA rarely works well. That said, the following libraries are genuinely useful for quickly building a base understanding of any dataset:
ydata-profiling (formerly pandas-profiling)
sweetviz
AutoViz
ydata-profiling (formerly pandas-profiling)
Note (2024 update): pandas-profiling has been renamed to ydata-profiling. Update your install accordingly.
!pip install ydata-profiling
!jupyter nbextension enable --py widgetsnbextension
import pandas as pd
from ydata_profiling import ProfileReport
profile = ProfileReport(
df_ohlc,
title="SPY Data",
html={'style': {'full_width': True}}
)
profile.to_file(output_file="profiling_report.html")
import IPython
IPython.display.HTML(filename='profiling_report.html')
!pip install ydata-profiling
!jupyter nbextension enable --py widgetsnbextension
import pandas as pd
from ydata_profiling import ProfileReport
profile = ProfileReport(
df_ohlc,
title="SPY Data",
html={'style': {'full_width': True}}
)
profile.to_file(output_file="profiling_report.html")
import IPython
IPython.display.HTML(filename='profiling_report.html')
!pip install ydata-profiling
!jupyter nbextension enable --py widgetsnbextension
import pandas as pd
from ydata_profiling import ProfileReport
profile = ProfileReport(
df_ohlc,
title="SPY Data",
html={'style': {'full_width': True}}
)
profile.to_file(output_file="profiling_report.html")
import IPython
IPython.display.HTML(filename='profiling_report.html')sweetviz
Sweetviz generates comprehensive HTML reports covering structure, distributions, and relationships within a dataset. It's especially powerful for comparing two datasets (e.g., train vs. test splits).
!pip install sweetviz
import sweetviz as sv
my_report = sv.analyze(df_ohlc)
my_report.show_html('sweetviz_report.html')
import IPython
IPython.display.HTML(filename='sweetviz_report.html')!pip install sweetviz
import sweetviz as sv
my_report = sv.analyze(df_ohlc)
my_report.show_html('sweetviz_report.html')
import IPython
IPython.display.HTML(filename='sweetviz_report.html')!pip install sweetviz
import sweetviz as sv
my_report = sv.analyze(df_ohlc)
my_report.show_html('sweetviz_report.html')
import IPython
IPython.display.HTML(filename='sweetviz_report.html')Comparing two datasets (useful for train/test split analysis or A/B comparisons):
import pandas_datareader.data as web
df_aapl = web.DataReader('AAPL', 'stooq', start='01/01/2023', end='06/01/2023')
df_amzn = web.DataReader('AMZN', 'stooq', start='01/01/2023', end='06/01/2023')
import sweetviz as sv
my_report = sv.compare([df_aapl, 'Apple'], [df_amzn, 'Amazon'], 'Close')
my_report.show_html(filepath='sweetviz_comparison_report.html')
import IPython
IPython.display.HTML(filename='/content/sweetviz_comparison_report.html')import pandas_datareader.data as web
df_aapl = web.DataReader('AAPL', 'stooq', start='01/01/2023', end='06/01/2023')
df_amzn = web.DataReader('AMZN', 'stooq', start='01/01/2023', end='06/01/2023')
import sweetviz as sv
my_report = sv.compare([df_aapl, 'Apple'], [df_amzn, 'Amazon'], 'Close')
my_report.show_html(filepath='sweetviz_comparison_report.html')
import IPython
IPython.display.HTML(filename='/content/sweetviz_comparison_report.html')import pandas_datareader.data as web
df_aapl = web.DataReader('AAPL', 'stooq', start='01/01/2023', end='06/01/2023')
df_amzn = web.DataReader('AMZN', 'stooq', start='01/01/2023', end='06/01/2023')
import sweetviz as sv
my_report = sv.compare([df_aapl, 'Apple'], [df_amzn, 'Amazon'], 'Close')
my_report.show_html(filepath='sweetviz_comparison_report.html')
import IPython
IPython.display.HTML(filename='/content/sweetviz_comparison_report.html')
4. Visualizing Data
One of the main challenges with plotting in Jupyter is that customizing charts takes a lot of code and time. Creating interactive charts is even harder.
We focus on Pareto efficiencies here: Colab's Chart Suggestions feature is our go-to for quick plots. For more sophisticated visualizations, the following notebooks are excellent starting points:
Colab Chart Suggestions
Colab Chart Suggestions is a fast way to build charts. It uses matplotlib and seaborn under the hood but abstracts away the complex configuration. The workflow:
Create a Pandas DataFrame
Click the Chart Suggestions icon (chart icon in the DataFrame output)
Select the chart type you want
Click Accept — Colab generates a ready-to-run code block
This alone can save 20–30 minutes per analysis session.
5. Resources
Using templated code lets your analytics team start and finish work faster across the board. Below are the resources referenced in this article and the Quickstart Notebook.
Importing & Exporting Data
Profiling Data
Visualizing Data
Useful Notebooks
Wrapping Up
The modules above form a foundation for quickly starting any data science project. From there, we layer on method-specific templates depending on the analysis at hand — whether that's a statistical model, a time series forecast, or an ML pipeline.
The principle is simple: the less time your team spends on setup, the more time they spend on insight.
→ Get the Quickstart Notebook on Google Colab
The Data Strategist helps startups and scaleups become data-driven. Get a data scientist on-demand, or advice on building your analytical data stack.