# Python Data Cleansing by Pandas & Numpy | Python Data Operations

Free Python course with 25 real-time projects Start Now!!

## 1. Python Data Cleansing – Objective

In our last Python tutorial, we studied Aggregation and Data Wrangling with Python. Today, we will discuss Python Data Cleansing tutorial, aims to deliver a brief introduction to the operations of data cleansing and how to carry your data in Python Programming. For this purpose, we will use two libraries- pandas and numpy. Moreover, we will discuss different ways to cleanse the missing data.
So, let’s start the Python Data Cleansing.

## 2. Python Data Cleansing – Prerequisites

As mentioned earlier, we will need two libraries for Python Data Cleansing – Python pandas and Python numpy.

### a. Pandas

Python pandas is an excellent software library for manipulating data and analyzing it. It will let us manipulate numerical tables and time series using data structures and operations.

You can install it using pip-

`C:\Users\lifei>pip install pandas`

Do You Know What is Exception Handling in Python Programming

### b. Numpy

Python numpy is another library we will use here. It lets us handle arrays and matrices, especially those multidimensional. It also provides several high-level mathematical functions to help us operate on these.

Use the following command in the command prompt to install Python numpy on your machine-

`C:\Users\lifei>pip install numpy`

## 3. Python Data Cleansing Operations on Data using NumPy

Using Python NumPy, let’s create an array (an n-dimensional array).

```>>> import numpy as np
>>> np.array(['a','b','c','d','e'],ndmin=2)```

array([[‘a’, ‘b’, ‘c’, ‘d’, ‘e’]], dtype='<U1′)

```>>> np.array([['a','b'],['c','d','e']])
```

array([list([‘a’, ‘b’]), list([‘c’, ‘d’, ‘e’])], dtype=object)

```>>> np.array(['a','b','c','d','e'],ndmin=1)
```

array([‘a’, ‘b’, ‘c’, ‘d’, ‘e’], dtype='<U1′)

```>>> np.array([1,2,7,9,8],dtype=complex)
```

array([1.+0.j, 2.+0.j, 7.+0.j, 9.+0.j, 8.+0.j])
While dtype lets us tell the interpreter of the data type to use, admin, lets us define the minimum dimension.
The following parameters will give us information about the array-

```>>> a=np.array(['a','b',2,'3.0'])
>>> a```

array([‘a’, ‘b’, ‘2’, ‘3.0’], dtype='<U3′)

```>>> type(a)
<class 'numpy.ndarray'>
>>> a.ndim```

1

```>>> a.shape
(4,)
>>> a.size```

4

```>>> a.dtype
dtype('<U3')```

Let’s Explore the Comparison Between Python Iterators and Generators
We can also perform operations like:

```>>> b=np.array([[1,2,3],[4,5,6]])
>>> b
```

array([[1, 2, 3],
[4, 5, 6]])

`>>> b.flatten()`

array([1, 2, 3, 4, 5, 6])

`>>> b.reshape(3,2)`

array([[1, 2],
[3, 4],
[5, 6]])

`>>> b[:2,::2]`

array([[1, 3],
[4, 6]])

`>>> b-4`

array([[-3, -2, -1],
[ 0, 1, 2]])

`>>> b.sum()`

21

`>>> b-2*b`

array([[-1, -2, -3],
[-4, -5, -6]])

`>>> np.sort(np.array([[3,2,1],[5,2,4]]))`

array([[1, 2, 3],
[2, 4, 5]])

## 4. Python Data Cleansing Operations on Data Using pandas

Pandas use three types to hold data- DataFrame, Panel, and Series.

### a. DataFrame

Pandas DataFrame is a data structure that holds data in two dimensions- as rows and columns. We have the following syntax-

`pandas.DataFrame(data, index, columns, dtype, copy)`

Now let’s try an example-

```>>> import pandas as pd
>>> data={'Element':['Silver','Gold','Platinum','Copper'],'Atomic Number':[47,79,78,29]}
>>> frame=pd.DataFrame(data,index=['element 1','element 2','element 3','element 4'])
>>> frame```

Have a Look at Python Inheritance, Method Overloading and Method Overriding

### b. Panel

Pandas panel holds data in three dimensions. Etymologically, the term panel data from one source for the name pandas. A panel has the following syntax:

pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)

```>>> data={'Red':pd.DataFrame(np.random.randn(4,2)),
'Blue':pd.DataFrame(np.random.randn(4,3))}
>>> pd.Panel(data)```

<class ‘pandas.core.panel.Panel’>

Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)

Items axis: Blue to Red

Major_axis axis: 0 to 3

Minor_axis axis: 0 to 2

### c. Series

Pandas Series holds data in one dimension, in a labeled format. The index is the set of axis labels we use.
It has the following syntax-

`pandas.Series(data, index, dtype, copy)`

Let’s take an example.

```>>> data=np.array([1,2,3,3,4])
>>> pd.Series(data)```

0 1
1 2
2 3
3 3
4 4
dtype: int32
Let’s take another example.

`>>> pd.Series(np.array(['a','c','b']))`

0 a
1 c
2 b
dtype: object
Using these data structures, we can manipulate data in many ways-

`>>> frame.iloc[0:2,:]`
`>>> frame.describe()`
`>>> frame.rank()`

This is all for now; we will learn about the libraries pandas and numpy in their own tutorials.
Read about Python Iterables and Python Itertools with Examples

## 5. Python Data Cleansing

When some part of our data is missing, due to whichever reason, the accuracy of our predictions plummets. In our article on data wrangling and aggregation, we discussed missing data and how to drop it. Let’s see how we can deal with this issue.

In real-time situations like the comment section of our website. The name and email are mandatory, but the input for ‘website’ can be left empty. Some users may not run a website to be eligible to fill in this information. In ways like this and others, we may end up with missing data in some places. How should we go about with this? Let’s find out.
Python Pandas will depict a missing value as NaN, which is short for Not a Number. Simply using the reindex() method will fill in NaN for blank values.

```>>> frame=pd.DataFrame(np.random.randn(4,3),index=[1,2,4,7],columns=['A','B','C'])
>>> frame.reindex([1,2,3,4,5,6,7])```

### a. Finding which columns have missing values

In the tutorial on wrangling, we saw how to find out which columns have missing values-

```>>> frame=frame.reindex([1,2,3,4,5,6,7])
>>> frame['B'].isnull()```

1   False

2   False

3   True

4   False

5   True

6   True

7   False

Name: B, dtype: bool

## 6. Ways to Cleanse Missing Data in Python

To perform a Python data cleansing, you can drop the missing values, replace them, replace each NaN with a scalar value, or fill forward or backward.

### a. Dropping Missing Values

You can exclude missing values from your dataset using the dropna() method.

`>>> frame.dropna()`

This defaults to dropping on axis=0, which excludes an entire row for an NaN value.
Do you know the Python Modules vs Packages

### b. Replacing Missing Values

To replace each NaN we have in the dataset, we can use the replace() method.

```>>> from numpy import NaN
>>> frame.replace({NaN:0.00})```

This way, we can also replace any value that we find enough times in the dataset.

### c. Replacing with a Scalar Value

We can use the fillna() method for this.

`>>> frame.fillna(7)`

### d. Filling Forward or Backward

If we supply a method parameter to the fillna() method, we can fill forward or backward as we need. To fill forward, use the methods pad or fill, and to fill backward, use bfill and backfill.

`>>> frame.fillna(method='pad')`
`>>> frame.fillna(method='backfill')`

Follow the link to know about Python Property – The Problem and Solution

## 7. Python Data Cleansing – Other Operations

While cleaning data, we may also need to find out more about it and manipulate it. Below, we make use of some of these operations.

```>>> data={'Element':['Silver','Gold','Platinum','Copper'],'Atomic Number':[47,79,78,29]}
>>> frame=pd.DataFrame(data,index=['element 1','element 2','element 3','element 4'])
>>> frame``` True

`>>> frame.head()` Data Cleansing Operations in Python

`>>> frame.tail(3)`  ### a. Renaming Columns

To rename a column, you can use the rename() method.

```>>> frame.rename(columns={'Atomic Number':'Number','Element':'Name'},inplace=True)
>>> frame``` Let’s revise Python Multithreading: A Comprehensive Tutorial

### b. Making Changes Stay

Also, throughout this tutorial “Python Data Cleansing”, the changes that we have made to the frames did not actually modify them. To make this happen, you can set the inplace=True parameter.
So, this was all about Python Data Cleansing Tutorial. Hope you like our explanation.

## 8. Conclusion

Hence, in this Python Data Cleansing, we learned how data is Cleans In Python Programming Language for this purpose, we used two libraries- pandas and numpy. Since data scientists spend 80% of their time cleaning and manipulating data, that makes it an essential skill to learn with data science. Tell us what you think in the comments below.
See Also – How Python Send Email Via SMTP
For reference

Your opinion matters
Please write your valuable feedback about DataFlair on Google | Facebook