This is an in-depth introduction to pandas fundamentals for those of you who are getting started with machine learning. Most of the online courses on machine learning do not expect you to know these topics as a pre-requisite. But believe me, it becomes quite challenging to follow those courses if you are not clear with these fundamentals.
I will walk you through the panda's fundamentals as I wish I was taught when I got started.
By end of this series, you will have a solid understanding of pandas fundamentals and how to utilize it to supercharge your machine learning journey.
Alright. Let’s get started with 2 questions which any machine learning beginner might have.
What is Pandas and why should I even care to learn it?
pandas is an open source, library providing high-performance, easy-to-use data structures, and data analysis tools for the Python programming language, written by Wes McKinney. This is the reason why pandas is the number one choice for data pre-processing among the machine learning community.
The First step in any machine learning work-flow is to prepare your dataset for processing.
The Bulk of your work as a machine learning engineer would go into preparing the data (load, cleanse, transform) before even applying any machine learning to it.
And pandas is one of the most popular tools for this. And this is why most of the machine learning courses make heavy use of pandas for processing the data.
Is pandas fast enough? Why not use numpy or pyspark ?
This is a fair enough question from anyone who is has been in the python ecosystem for a while. Let’s try to demystify it.
Yes. Pandas is very fast. It is fast enough to get your job done unless your data sets run into tens of hundreds gigabytes. Pandas can handle few gigabytes of data wrangling on your laptop comfortably. The only thing that is going to limit you is the RAM and processing power of your computer.
Now, let’s take the case of numpy. Yes. numpy may be faster than pandas. In fact most of the pandas is built on top of numpy.
Duhh! Then why not use numpy?
The truth is numpy and pandas are directed towards different objectives. Numpy is a high-performance numerical library bringing the power of linear algebra to python. This means you can use vectors and matrices as native objects in the python code. These are much much faster than native lists with additional functionality.
On the other hand Pandas library is designed to be the right hand in data Therefore it provides functionality like handling missing data, import data in a format used for data analysis, etc. It also offers statistical tools to help you understand your data or clean your data.
Here is a benchmark between numpy and pandas for those of who you are interested in the details: For a 15 M records of iris dataset (source)
- numpy consumes (roughly 1/3) less memory compared to pandas
- numpy generally performs better than pandas for 50K rows or less
- pandas generally perform better than numpy for 500K rows or more
- for 50K to 500K rows, it is a toss-up between pandas and numpy depending on the kind of operation
The rule of thumb: Use NumPy if you are solving numerical problems. Use Pandas if you are working on analysing and cleansing data.
Now let's talk about pyspark. Spark and PySpark are directed toward processing big data. Spark shines at processing and doing computations on large data files that run into gigabytes or even terabytes in size. And according to benchmarks here, it is not recommended that any data-set with less than 10 million rows (<5 GB file) be analyzed with Spark. Of course, this is dependant on the type of data and computations you want to do on the data.
Okey. Now that you are clear why use pandas, let’s dig deep into Pandas data structures
Intro about pandas data structures
We will cover these topics in the first part of this post
- Pandas Series
I recommend you to try out these exercises in a google colab notebook. Go for it. Trying it yourself is the best way to understand any subject.
- Pandas series is a one-dimensional array with labels, capable of holding data of any type (integer, string, float, python objects, etc.). Lets now create a simple pandas series. Think of it as a python list with a name.
Notice its pd.Series and not pd.series.
Where do you use a pandas series in real life? Let's say you want to list 3 movies of Jim Carrey? We can use a series to represent this. Let see how.
Since we didn’t specify an index, it is assigned with the default index 0, 1, 2, 3 etc.
But there is more to it. We can specify a custom index and even a name to a series.
Take special note of the indexes and name of series because these are very useful when you are working on manipulating data.
For those of you who are coming from the python world, you can create a series from a dictionary also.
I will cover Pandas Dataframe in next post. Stay tuned ! Also let me know your comments below.