Python Data Cleansing by Pandas & Numpy | Python Data Operations

Python course with 57 real-time projects - Learn Python

1. Python Data Cleansing – Objective

In our last Python tutorial, we studied Aggregation and Data Wrangling with Python. Today, we will discuss Python Data Cleansing tutorial, aims to deliver a brief introduction to the operations of data cleansing and how to carry your data in Python Programming. For this purpose, we will use two libraries- pandas and numpy. Moreover, we will discuss different ways to cleanse the missing data.
So, let’s start the Python Data Cleansing.

Python Data Cleansing by pandas & numpy | Python Data Operations

2. Python Data Cleansing – Prerequisites

As mentioned earlier, we will need two libraries for Python Data Cleansing – Python pandas and Python numpy.

a. Pandas

Python pandas is an excellent software library for manipulating data and analyzing it. It will let us manipulate numerical tables and time series using data structures and operations.

Python Data Cleansing – Python Pandas

You can install it using pip-

C:\Users\lifei>pip install pandas

Do You Know What is Exception Handling in Python Programming

b. Numpy

Python numpy is another library we will use here. It lets us handle arrays and matrices, especially those multidimensional. It also provides several high-level mathematical functions to help us operate on these.

Python Data Cleansing – Python numpy

Use the following command in the command prompt to install Python numpy on your machine-

C:\Users\lifei>pip install numpy

3. Python Data Cleansing Operations on Data using NumPy

Using Python NumPy, let’s create an array (an n-dimensional array).

>>> import numpy as np
>>> np.array(['a','b','c','d','e'],ndmin=2)

array([[‘a’, ‘b’, ‘c’, ‘d’, ‘e’]], dtype='<U1′)

>>> np.array([['a','b'],['c','d','e']])

array([list([‘a’, ‘b’]), list([‘c’, ‘d’, ‘e’])], dtype=object)

>>> np.array(['a','b','c','d','e'],ndmin=1)

array([‘a’, ‘b’, ‘c’, ‘d’, ‘e’], dtype='<U1′)

>>> np.array([1,2,7,9,8],dtype=complex)

array([1.+0.j, 2.+0.j, 7.+0.j, 9.+0.j, 8.+0.j])
While dtype lets us tell the interpreter of the data type to use, admin, lets us define the minimum dimension.
The following parameters will give us information about the array-

>>> a=np.array(['a','b',2,'3.0'])
>>> a

array([‘a’, ‘b’, ‘2’, ‘3.0’], dtype='<U3′)

>>> type(a)
<class 'numpy.ndarray'>
>>> a.ndim

>>> a.shape
(4,)
>>> a.size

>>> a.dtype
dtype('<U3')

Let’s Explore the Comparison Between Python Iterators and Generators
We can also perform operations like:

>>> b=np.array([[1,2,3],[4,5,6]])
>>> b

array([[1, 2, 3],
[4, 5, 6]])

>>> b.flatten()

array([1, 2, 3, 4, 5, 6])

>>> b.reshape(3,2)

array([[1, 2],
[3, 4],
[5, 6]])

>>> b[:2,::2]

array([[1, 3],
[4, 6]])

>>> b-4

array([[-3, -2, -1],
[ 0, 1, 2]])

>>> b.sum()

>>> b-2*b

array([[-1, -2, -3],
[-4, -5, -6]])

>>> np.sort(np.array([[3,2,1],[5,2,4]]))

array([[1, 2, 3],
[2, 4, 5]])

4. Python Data Cleansing Operations on Data Using pandas

Pandas use three types to hold data- DataFrame, Panel, and Series.

Operations on Data Using Python pandas

a. DataFrame

Pandas DataFrame is a data structure that holds data in two dimensions- as rows and columns. We have the following syntax-

pandas.DataFrame(data, index, columns, dtype, copy)

Now let’s try an example-

>>> import pandas as pd
>>> data={'Element':['Silver','Gold','Platinum','Copper'],'Atomic Number':[47,79,78,29]}
>>> frame=pd.DataFrame(data,index=['element 1','element 2','element 3','element 4'])
>>> frame

Python Pandas – DataFrame

Have a Look at Python Inheritance, Method Overloading and Method Overriding

b. Panel

Pandas panel holds data in three dimensions. Etymologically, the term panel data from one source for the name pandas. A panel has the following syntax:

pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)

>>> data={'Red':pd.DataFrame(np.random.randn(4,2)),
	'Blue':pd.DataFrame(np.random.randn(4,3))}
>>> pd.Panel(data)

Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)

Items axis: Blue to Red

Major_axis axis: 0 to 3

Minor_axis axis: 0 to 2

c. Series

Pandas Series holds data in one dimension, in a labeled format. The index is the set of axis labels we use.
It has the following syntax-

pandas.Series(data, index, dtype, copy)

Let’s take an example.

>>> data=np.array([1,2,3,3,4])
>>> pd.Series(data)

0 1
1 2
2 3
3 3
4 4
dtype: int32
Let’s take another example.

>>> pd.Series(np.array(['a','c','b']))

0 a
1 c
2 b
dtype: object
Using these data structures, we can manipulate data in many ways-

>>> frame.iloc[0:2,:]

Python Data Cleansing by pandas & numpy | Python Data Operations

>>> frame.describe()

Python Pandas – Series “Describe”

>>> frame.rank()

Python Pandas – Series “Rank”

This is all for now; we will learn about the libraries pandas and numpy in their own tutorials.
Read about Python Iterables and Python Itertools with Examples

5. Python Data Cleansing

When some part of our data is missing, due to whichever reason, the accuracy of our predictions plummets. In our article on data wrangling and aggregation, we discussed missing data and how to drop it. Let’s see how we can deal with this issue.

In real-time situations like the comment section of our website. The name and email are mandatory, but the input for ‘website’ can be left empty. Some users may not run a website to be eligible to fill in this information. In ways like this and others, we may end up with missing data in some places. How should we go about with this? Let’s find out.
Python Pandas will depict a missing value as NaN, which is short for Not a Number. Simply using the reindex() method will fill in NaN for blank values.

>>> frame=pd.DataFrame(np.random.randn(4,3),index=[1,2,4,7],columns=['A','B','C'])
>>> frame.reindex([1,2,3,4,5,6,7])

Python Data Cleansing

a. Finding which columns have missing values

In the tutorial on wrangling, we saw how to find out which columns have missing values-

>>> frame=frame.reindex([1,2,3,4,5,6,7])
>>> frame['B'].isnull()

1 False

2 False

3 True

4 False

5 True

6 True

7 False

Name: B, dtype: bool

6. Ways to Cleanse Missing Data in Python

To perform a Python data cleansing, you can drop the missing values, replace them, replace each NaN with a scalar value, or fill forward or backward.

Ways to Cleanse Missing Data in Python

a. Dropping Missing Values

You can exclude missing values from your dataset using the dropna() method.

>>> frame.dropna()

Ways for Python Data Cleansing – Dropping Missing Values

This defaults to dropping on axis=0, which excludes an entire row for an NaN value.
Do you know the Python Modules vs Packages

b. Replacing Missing Values

To replace each NaN we have in the dataset, we can use the replace() method.

>>> from numpy import NaN
>>> frame.replace({NaN:0.00})

Ways for Python Data Cleansing – Replacing Missing Values

This way, we can also replace any value that we find enough times in the dataset.

c. Replacing with a Scalar Value

We can use the fillna() method for this.

>>> frame.fillna(7)

Ways for Python Data Cleansing – Replacing with a Scalar Value

d. Filling Forward or Backward

If we supply a method parameter to the fillna() method, we can fill forward or backward as we need. To fill forward, use the methods pad or fill, and to fill backward, use bfill and backfill.

>>> frame.fillna(method='pad')

Ways for Python Data Cleansing – Filling Forward or Backward

>>> frame.fillna(method='backfill')

Ways for Python Data Cleansing – Filling Forward or Backward

Follow the link to know about Python Property – The Problem and Solution

7. Python Data Cleansing – Other Operations

While cleaning data, we may also need to find out more about it and manipulate it. Below, we make use of some of these operations.

>>> data={'Element':['Silver','Gold','Platinum','Copper'],'Atomic Number':[47,79,78,29]}
>>> frame=pd.DataFrame(data,index=['element 1','element 2','element 3','element 4'])
>>> frame

True

>>> frame.head()

Python Data Cleansing by pandas & numpy | Python Data Operations

Data Cleansing Operations in Python

>>> frame.tail(3)

a. Renaming Columns

To rename a column, you can use the rename() method.

>>> frame.rename(columns={'Atomic Number':'Number','Element':'Name'},inplace=True)
>>> frame

Let’s revise Python Multithreading: A Comprehensive Tutorial

b. Making Changes Stay

Also, throughout this tutorial “Python Data Cleansing”, the changes that we have made to the frames did not actually modify them. To make this happen, you can set the inplace=True parameter.
So, this was all about Python Data Cleansing Tutorial. Hope you like our explanation.

8. Conclusion

Hence, in this Python Data Cleansing, we learned how data is Cleans In Python Programming Language for this purpose, we used two libraries- pandas and numpy. Since data scientists spend 80% of their time cleaning and manipulating data, that makes it an essential skill to learn with data science. Tell us what you think in the comments below.
See Also – How Python Send Email Via SMTP
For reference

You give me 15 seconds I promise you best tutorials
Please share your happy experience on Google

Tags: clean() python cleaning text data python data cleaning steps in python Data Cleansing in Python Data Cleansing Python Pandas DataFrame Pandas Panel Pandas Series Python Data Cleansing Python Data Cleansing Tutorial python for data analysis Python NumPy Python Pandas

Python Data Cleansing by Pandas & Numpy | Python Data Operations

1. Python Data Cleansing – Objective

2. Python Data Cleansing – Prerequisites

a. Pandas

b. Numpy

3. Python Data Cleansing Operations on Data using NumPy

4. Python Data Cleansing Operations on Data Using pandas

a. DataFrame

b. Panel

c. Series

5. Python Data Cleansing

a. Finding which columns have missing values

6. Ways to Cleanse Missing Data in Python

a. Dropping Missing Values

b. Replacing Missing Values

c. Replacing with a Scalar Value

d. Filling Forward or Backward

7. Python Data Cleansing – Other Operations

a. Renaming Columns

b. Making Changes Stay

8. Conclusion

Leave a Reply Cancel reply

About DataFlair

Trending Courses

Trending Data Science Courses

Free Big Data Courses

Trending Programming Courses

Trending Data Science Tutorials

Trending Projects

Trending Programming Tutorials

Trending Tutorials