Python Data Cleansing by Pandas & Numpy | Python Data Operations


1. Objective

In our last Python tutorial, we studied Aggregation and Data Wrangling with Python. Today, we will discuss Python Data Cleansing tutorial, aims to deliver a brief introduction to the operations of data cleansing and how to carry your data in Python Programming. For this purpose, we will use two libraries- pandas and numpy. Moreover, we will discuss different ways to cleanse the missing data.

So, let’s start the Python Data Cleansing.

Python Data Cleansing by pandas & numpy | Python Data Operations

Python Data Cleansing by pandas & numpy | Python Data Operations

2. Python Data Cleansing – Prerequisites

As mentioned earlier, we will need two libraries for Python Data Cleansing – Python pandas and Python numpy.

a. Pandas

Python pandas is an excellent software library for manipulating data and analyzing it. It will let us manipulate numerical tables and time series using data structures and operations.

Python Data Cleansing by pandas & numpy | Python Data Operations

Python Data Cleansing – Python Pandas

You can install it using pip-

C:\Users\lifei>pip install pandas

Do You Know What is Exception Handling in Python Programming

b. Numpy

Python numpy is another library we will use here. It lets us handle arrays and matrices, especially those multidimensional. It also provides several high-level mathematical functions to help us operate on these.

Python Data Cleansing by pandas & numpy | Python Data Operations

Python Data Cleansing – Python numpy

Use the following command in the command prompt to install Python numpy on your machine-

C:\Users\lifei>pip install numpy

3. Operations on Data using NumPy

Using Python NumPy, let’s create an array (an n-dimensional array).

>>> import numpy as np
>>> np.array(['a','b','c','d','e'],ndmin=2)

array([[‘a’, ‘b’, ‘c’, ‘d’, ‘e’]], dtype='<U1′)

>>> np.array([['a','b'],['c','d','e']]) 

array([list([‘a’, ‘b’]), list([‘c’, ‘d’, ‘e’])], dtype=object)

>>> np.array(['a','b','c','d','e'],ndmin=1) 

array([‘a’, ‘b’, ‘c’, ‘d’, ‘e’], dtype='<U1′)

>>> np.array([1,2,7,9,8],dtype=complex) 

array([1.+0.j, 2.+0.j, 7.+0.j, 9.+0.j, 8.+0.j])

While dtype lets us tell the interpreter of the data type to use, admin, lets us define the minimum dimension.

The following parameters will give us information about the array-

>>> a=np.array(['a','b',2,'3.0'])
>>> a

array([‘a’, ‘b’, ‘2’, ‘3.0’], dtype='<U3′)

>>> type(a)
<class 'numpy.ndarray'>
>>> a.ndim

1

>>> a.shape
(4,)
>>> a.size

4

>>> a.dtype
dtype('<U3')

Let’s Explore the Comparison Between Python Iterators and Generators

We can also perform operations like:

>>> b=np.array([[1,2,3],[4,5,6]])
>>> b

array([[1, 2, 3],

[4, 5, 6]])

>>> b.flatten()

array([1, 2, 3, 4, 5, 6])

>>> b.reshape(3,2)

array([[1, 2],

[3, 4],

[5, 6]])

>>> b[:2,::2]

array([[1, 3],

[4, 6]])

>>> b-4

array([[-3, -2, -1],

[ 0, 1, 2]])

>>> b.sum()

21

>>> b-2*b

array([[-1, -2, -3],

[-4, -5, -6]])

>>> np.sort(np.array([[3,2,1],[5,2,4]]))

array([[1, 2, 3],

[2, 4, 5]])

4. Operations on Data Using pandas

Pandas use three types to hold data- DataFrame, Panel, and Series.

Python Data Cleansing by pandas & numpy | Python Data Operations

Operations on Data Using Python pandas

a. DataFrame

Pandas DataFrame is a data structure that holds data in two dimensions- as rows and columns. We have the following syntax-

pandas.DataFrame(data, index, columns, dtype, copy)

Now let’s try an example-

>>> import pandas as pd
>>> data={'Element':['Silver','Gold','Platinum','Copper'],'Atomic Number':[47,79,78,29]}
>>> frame=pd.DataFrame(data,index=['element 1','element 2','element 3','element 4'])
>>> frame
Python Data Cleansing by pandas & numpy | Python Data Operations

Python Pandas – DataFrame

Have a Look at Python Inheritance, Method Overloading and Method Overriding

b. Panel

Pandas panel holds data in three dimensions. Etymologically, the term panel data from one source for the name pandas. A panel has the following syntax:

pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
>>> data={'Red':pd.DataFrame(np.random.randn(4,2)),
'Blue':pd.DataFrame(np.random.randn(4,3))}
>>> pd.Panel(data)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Blue to Red
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

c. Series

Pandas Series holds data in one dimension, in a labeled format. The index is the set of axis labels we use.

It has the following syntax-

pandas.Series(data, index, dtype, copy)

Let’s take an example.

>>> data=np.array([1,2,3,3,4])
>>> pd.Series(data)

0 1

1 2

2 3

3 3

4 4

dtype: int32

Let’s take another example.

>>> pd.Series(np.array(['a','c','b']))

0 a

1 c

2 b

dtype: object

Using these data structures, we can manipulate data in many ways-

>>> frame.iloc[0:2,:]
Python Data Cleansing by pandas & numpy | Python Data Operations

Python Pandas – Series “Head”

>>> frame.describe()
Python Data Cleansing by pandas & numpy | Python Data Operations

Python Pandas – Series “Describe”

>>> frame.rank()
Python Data Cleansing by pandas & numpy | Python Data Operations

Python Pandas – Series “Rank”

This is all for now; we will learn about the libraries pandas and numpy in their own tutorials.

Read about Python Iterables and Python Itertools with Examples

5. Python Data Cleansing

When some part of our data is missing, due to whichever reason, the accuracy of our predictions plummets. In our article on data wrangling and aggregation, we discussed missing data and how to drop it. Let’s see how we can deal with this issue.

In real-time situations like the comment section of our website. The name and email are mandatory, but the input for ‘website’ can be left empty. Some users may not run a website to be eligible to fill in this information. In ways like this and others, we may end up with missing data in some places. How should we go about with this? Let’s find out.

Python Pandas will depict a missing value as NaN, which is short for Not a Number. Simply using the reindex() method will fill in NaN for blank values.

>>> frame=pd.DataFrame(np.random.randn(4,3),index=[1,2,4,7],columns=['A','B','C'])
>>> frame.reindex([1,2,3,4,5,6,7])
Python Data Cleansing by pandas & numpy | Python Data Operations

Python Data Cleansing

a. Finding which columns have missing values

In the tutorial on wrangling, we saw how to find out which columns have missing values-

>>> frame=frame.reindex([1,2,3,4,5,6,7])
>>> frame['B'].isnull()
1 False
2 False
3 True
4 False
5 True
6 True
7 False
Name: B, dtype: bool

6. Ways to Cleanse Missing Data in Python

To perform a Python data cleansing, you can drop the missing values, replace them, replace each NaN with a scalar value, or fill forward or backward.

Python Data Cleansing by pandas & numpy | Python Data Operations

Ways to Cleanse Missing Data in Python

a. Dropping Missing Values

You can exclude missing values from your dataset using the dropna() method.

>>> frame.dropna()
Python Data Cleansing by pandas & numpy | Python Data Operations

Ways for Python Data Cleansing – Dropping Missing Values

This defaults to dropping on axis=0, which excludes an entire row for an NaN value.

Do you know the Python Modules vs Packages

b. Replacing Missing Values

To replace each NaN we have in the dataset, we can use the replace() method.

>>> from numpy import NaN
>>> frame.replace({NaN:0.00})
Python Data Cleansing by pandas & numpy | Python Data Operations

Ways for Python Data Cleansing – Replacing Missing Values

This way, we can also replace any value that we find enough times in the dataset.

c. Replacing with a Scalar Value

We can use the fillna() method for this.

>>> frame.fillna(7)
Python Data Cleansing by pandas & numpy | Python Data Operations

Ways for Python Data Cleansing – Replacing with a Scalar Value

d. Filling Forward or Backward

If we supply a method parameter to the fillna() method, we can fill forward or backward as we need. To fill forward, use the methods pad or fill, and to fill backward, use bfill and backfill.

>>> frame.fillna(method='pad')
Python Data Cleansing by pandas & numpy | Python Data Operations

Ways for Python Data Cleansing – Filling Forward or Backward

>>> frame.fillna(method='backfill')
Python Data Cleansing by pandas & numpy | Python Data Operations

Ways for Python Data Cleansing – Filling Forward or Backward

Follow the link to know about Python Property – The Problem and Solution

7. Python Data Cleansing – Other Operations

While cleaning data, we may also need to find out more about it and manipulate it. Below, we make use of some of these operations.

>>> data={'Element':['Silver','Gold','Platinum','Copper'],'Atomic Number':[47,79,78,29]}
>>> frame=pd.DataFrame(data,index=['element 1','element 2','element 3','element 4'])
>>> frame
Python Data Cleansing by pandas & numpy | Python Data Operations

Python Data Cleansing Operations

>>> frame['Atomic Number'].is_unique

True

>>> frame.head()
Python Data Cleansing by pandas & numpy | Python Data Operations

Python Data Cleansing Operations

>>> frame.head(2)
Python Data Cleansing by pandas & numpy | Python Data Operations

Python Data Cleansing by pandas & numpy | Python Data Operations

Python Data Cleansing Operations

>>> frame.tail(3)
Python Data Cleansing by pandas & numpy | Python Data Operations

Python Data Cleansing Operations

>>> frame.loc['element 2']
Python Data Cleansing by pandas & numpy | Python Data Operations

Python Data Cleansing Operations

>>> frame.get_dtype_counts()
int64 1
object 1
dtype: int64

a. Renaming Columns

To rename a column, you can use the rename() method.

>>> frame.rename(columns={'Atomic Number':'Number','Element':'Name'},inplace=True)
>>> frame
Python Data Cleansing by pandas & numpy | Python Data Operations

Python Data Cleansing Operations – Renaming Columns

Let’s revise Python Multithreading: A Comprehensive Tutorial

b. Making Changes Stay

Also, throughout this tutorial “Python Data Cleansing”, the changes that we have made to the frames did not actually modify them. To make this happen, you can set the inplace=True parameter.

So, this was all about Python Data Cleansing Tutorial. Hope you like our explanation.

8. Conclusion

Hence, in this Python Data Cleansing, we learned how data is Cleans In Python Programming Language for this purpose, we used two libraries- pandas and numpy. Since data scientists spend 80% of their time cleaning and manipulating data, that makes it an essential skill to learn with data science. Tell us what you think in the comments below.

See Also – How Python Send Email Via SMTP

For reference

Leave a comment

Your email address will not be published. Required fields are marked *