Introduction To Pandas For Data Manipulation

Introduction to Pandas for Data Manipulation in Python

Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It’s a must-know tool for anyone serious about using Python for data analysis or data science. This article will introduce you to the wonders of Pandas, bringing you from a beginner to a skilled user of this powerful library in Python.


Introduction To Pandas For Data Manipulation
Introduction To Pandas For Data Manipulation

Apart from the usual Python knowledge, you’ll also learn easy-to-remember commands and functionalities to perform tasks that, in other programming languages, would need much more code. So, let’s get started!

Table of Contents:

  • What is Pandas?
  • Installation of Pandas
  • Data Structures in Pandas
  • Data Manipulation using Pandas
  • Important Functions in Pandas
  • Working with Missing Data
  • Grouping and Aggregation in Pandas
  • Merging, Joining and Concatenating DataFrames
  • Conclusion

What is Pandas?

Pandas, short for ‘Python Data Analysis Library’, is built on top of NumPy and provides data structures and functions needed to manipulate structured data. The main structures provided by pandas are Series and DataFrames, which we will cover later in this article.

import pandas as pd
import numpy as np

Installation of Pandas

If pandas is not already installed on your system, you can install it easily using pip:

pip install pandas

Or with conda, if you’re using the Anaconda distribution:

conda install pandas

Data Structures in Pandas

There are two types of data structures provided by Pandas — Series and DataFrame.

Series

A series is a one-dimensional labeled array capable of holding any data type. We can easily create Series by using:

s = pd.Series([1,3,5,np.nan,6,8])

DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. DataFrames are generally the most commonly used pandas object. You can think of it as a spreadsheet or SQL table, or a dictionary of Series objects.

df = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3] * 4,dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]),
                    'F' : 'foo' })

Data Manipulation using Pandas

Loading, editing, and viewing data in pandas DataFrame is a core part of data analysis in Python. Let’s discuss some of these functionalities.

Loading Data

If you have a CSV file, it’s simple to load this into a DataFrame:

df = pd.read_csv('mydata.csv')

Viewing Data

To see the top rows of the DataFrame, we’d use the ‘head’ function:

df.head()

To see the last rows, use the ‘tail’ function:

df.tail(3)

Selection of Data

You can select data from a DataFrame in multiple ways. Here are some examples:

# Selecting a single column 
df['A']

# Selecting via [], which slices the rows.
df[0:3]

# Selecting on a multi-axis by label
df.loc[:,['A','B']]

Setting Data

Setting a new column automatically aligns the data by the indexes:

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
df['F'] = s1

Important Functions in Pandas

There are many functions provided by pandas for different purposes. The most commonly used functions are:

  • mean(): It is used to calculate the arithmetic mean of a given set of numbers.
  • sum(): It is used to calculate the sum of a given set of numbers.
  • max(): It is used to find the highest value in a set of numbers.
  • min(): It is used to find the smallest value in a set of numbers.
  • std(): It is used to calculate the standard deviation of a given set of numbers.
df.mean()
df.sum()
df.max()
df.min()
df.std()

Working with Missing Data

Pandas primarily uses the value np.nan to represent missing data. By default, it is not included in computations.

df.dropna(how='any')   # Drops rows with missing data
df.fillna(value=5)    # Fills missing data with a value
pd.isna(df)          # Indicates whether values are missing

Grouping and Aggregation in Pandas

Grouping and aggregation is an important part of data analysis and pandas provides us with powerful tools to perform these operations.

df.groupby('A').sum()  # Groups by column 'A' and calculates sum of other columns
df.groupby(['A','B']).mean()  # groups by multiple columns

Merging, Joining and Concatenating DataFrames

Pandas provides various ways to combine DataFrames including merge, join, and concat.

# Concatenation
result = pd.concat([df1, df2])

# Merge
result = pd.merge(left, right, on='key')

# Join
result = left.join(right, on='key')

Conclusion

We have covered a lot here, but pandas is an extensive library that has a multitude of functions available to make our jobs easier. The best way to become proficient is to practice, take some data, and play around with pandas functions. You’ll be surprised how much you can achieve.

Pandas provides you with the necessary tools to clean, process, manipulate, aggregate, and visualize data, while being easy to use and versatile enough to handle practically any task no matter the dataset’s size. Whether you’re a new data analyst or planning to pivot into this exciting field, getting familiar with Python pandas should be the first step on your journey.

Share this article:

Leave a Comment