Categories
Tools

The SciPy Stack

SciPy Logo

This seems like a pretty good place to start my journey to becoming a deep learning practitioner. As you will see, this stack is a core component in the workflow that’s done in this space. Please consider this a work in progress.

The SciPy Stack is at the core of a lot of work in the scientific computing ecosystem. It consists of the following packages:

  • NumPy
  • SciPy Library
  • Matplotlib
  • pandas
  • SymPy

This post will be updated over time with more information and resources on each of the components. Come learn with me.

Last update: April 5, 2020

What’s it for

According to the SciPy website:

The SciPy ecosystem includes general and specialised tools for data management and computation, productive experimentation and high-performance computing.

Scipy.org

The SciPy Stack is a dependency for many different packages used in the Machine Learning and Data Science disciplines. While it may not be necessary to learn the ins and outs of the SciPyStack, it is very helpful to at least have a cursory understanding of what it provides.

I’ll detail that next.

Components

NumPy

NumPy is the base package upon which all of this is built. It provides data structures, functions and routines for highly optimized operations on those data structures.

The core object that NumPy provides is the ndarray object. It’s an n-dimensional array. Unlike Pythons, ndarrys cannot grow dynamically. All elements must be of the same type.

It speed comes from a pre-compiled C back-end.

SciPy Library

The SciPy Library itself provides functions for numerical integration, interpolation, optimization, linear algebra, and statistics. It’s collection of mathematical algorithms and functions built on top of the NumPy library.

When used intereactively, your python session becomes a data processing environment on par with R.

Documentation for both NumPy and SciPy can be found at https://docs.scipy.org/.

Matplotlib

From the Matplotlib website:

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.

matplotlib.org

It can be used to create a plethora of different types of visualization from simple bar and line graphs to scatter plots and contour plots, even animations and 3D plots.

There is a strong and inclusive community built around this library as well.

pandas

pandas is a critical library for the purposes of doing data science and machine learning. It provides a lot of useful features such as:

  • DataFrame objects that enable easier data manipulation
  • Tools to read and write data between in-memory data structures and a variety of file formats (CSV, Excel, SQL databases, and HDF5)
  • Tools to align data and handle missing data
  • Transformation of data sets
  • Slicing, indexing and sub-setting of data sets
  • Merging and joining of data sets
  • Efficient handling of time series data
  • Very performant with optimized C code for critical paths

Documentation can be found at https://pandas.pydata.org/docs/.

SymPy

While SymPy is not necessarily used in the machine learning space, it is included here as it is identified as a core component of the SciPy space.

It aims to become a full-featured computer algebra system. It’s written completely in Python.

Documentation and tutorials can be found at https://www.sympy.org/en/index.html.

Tying it all together

This post is a very high level overview of the SciPy Stack. As I learn more about these libraries, I will be adding in additional resources and hopefully show more about how each of these pieces support data science and machine learning in my own personal projects.

If you have something that you feel would be of interest to me in this process please leave a comment below and I’ll be sure to incorporate it here.