When starting a new data science project, it is a good practice to implement some simple data readiness steps. In this context, data readiness refers to the acquisition, [light] preprocessing, and storage of data that will be used for the project.

A good process can be automated and should allow traceability - which helps when quality issues or questions about provenance arise. The process should also provide data source consistency for team members.

The following is a simple standard for data readiness which provides a directory structure for storing data and logic (i.e. scripts) and standards for workflow. The workflow follows a general extract, transform, load (ETL) pattern. This is intended for small scale projects with static data requirements. Projects that require dynamic or near real-time data should use different methods.

Directory Structure

The following directory structure is used to store logic and data. To make things easier, I wrote a python script to generate this skeleton template.

  • etl\extract Scripts to move src files to stage/src
  • etl\load Scripts to load tables from stage/tbl to production system
  • etl\local Local storage for source files obtained manually
  • etl\stage\src Landing zone for raw source files
  • etl\stage\tbl Landing zone for final data tables
  • etl\stage\tmp Landing zone for intermediate processed data tables
  • etl\transform\build Scripts to build final tables (stage/tbl)
  • etl\transform\prep Scripts to preprocess src files (stage/tmp)

Workflow

This workflow is listed in sequence of task execution [extract, transform-prep, transform-build, load]. Dependencies should exist from task to task, but not within a task. For multi-person teams, work can be divided by task and/or system. For example, with systems {x,y} and staff {a,b,c}:

Table 1: divided by system

  E T L
x a a a
y b b b

Table 2: divided by task

  E T L
x a b c
y a b c

Data

  • The stage is a transient area used to store files transferred from external systems, intermediate tables, and final tables
  • The stage is designed to be deleted or overwritten as ETL scripts re-create the content at runtime
  • Source files obtained manually (e.g. email attachment) should be stored in the etl\local directory
  • The etl\local directory should not be deleted at runtime

Extract

Script Repository etl\extract
Data Source external or etl\local
Data Destination etl\stage\src
  • An extract script transfers source files from an external source or the etl\local directory to the etl\stage\src directory on the local machine
  • Generally one extract script can be created for each system or file
  • Name scripts according to the external system or system and file (e.g. uci.py or uci_iris.py)

Transform

  • Transform scripts prepare intermediate tables and build final tables

Transform - Prep

Script Repository etl\transform\prep
Data Source etl\stage\src
Data Destination etl\stage\tmp
  • A prep script performs common pre-build steps on a source file such as format conversion and column name standardization
  • Prep scripts are useful to limit repetition in build scripts (e.g. when a single source file is used to build multiple tables)
  • Name scripts according to the source file (e.g. iris.py)

Transform - Build

Script Repository etl\transform\build
Data Source etl\stage\src or etl\stage\tmp
Data Destination etl\stage\tbl
  • A build script creates the final table and contains steps such as joins, reshaping, conversion, and casting
  • A single build script should exist for each table
  • Name scripts according to the final table name (e.g. iris.py)

Load

Script Repository etl\load
Data Source etl\stage\tbl
Data Destination data repository
  • A load script transfers final tables from etl\stage\tbl to the project data repository, hopefully at a location all team members can access.