Python Data Cleansing by Pandas & Numpy | Python Data Operations
Master Python with 70+ Hands-on Projects and Get Job-ready - Learn Python
1. Python Data Cleansing – Objective
In our last Python tutorial, we studied Aggregation and Data Wrangling with Python. Today, we will discuss Python Data Cleansing tutorial, aims to deliver a brief introduction to the operations of data cleansing and how to carry your data in Python Programming. For this purpose, we will use two libraries- pandas and numpy. Moreover, we will discuss different ways to cleanse the missing data.
So, let’s start the Python Data Cleansing.
2. Python Data Cleansing – Prerequisites
As mentioned earlier, we will need two libraries for Python Data Cleansing – Python pandas and Python numpy.
a. Pandas
Python pandas is an excellent software library for manipulating data and analyzing it. It will let us manipulate numerical tables and time series using data structures and operations.
You can install it using pip-
C:\Users\lifei>pip install pandas
Do You Know What is Exception Handling in Python Programming
b. Numpy
Python numpy is another library we will use here. It lets us handle arrays and matrices, especially those multidimensional. It also provides several high-level mathematical functions to help us operate on these.
Use the following command in the command prompt to install Python numpy on your machine-
C:\Users\lifei>pip install numpy
3. Python Data Cleansing Operations on Data using NumPy
Using Python NumPy, let’s create an array (an n-dimensional array).
>>> import numpy as np >>> np.array(['a','b','c','d','e'],ndmin=2)
array([[‘a’, ‘b’, ‘c’, ‘d’, ‘e’]], dtype='<U1′)
>>> np.array([['a','b'],['c','d','e']])Â
array([list([‘a’, ‘b’]), list([‘c’, ‘d’, ‘e’])], dtype=object)
>>> np.array(['a','b','c','d','e'],ndmin=1)Â
array([‘a’, ‘b’, ‘c’, ‘d’, ‘e’], dtype='<U1′)
>>> np.array([1,2,7,9,8],dtype=complex)Â
array([1.+0.j, 2.+0.j, 7.+0.j, 9.+0.j, 8.+0.j])
While dtype lets us tell the interpreter of the data type to use, admin, lets us define the minimum dimension.
The following parameters will give us information about the array-
>>> a=np.array(['a','b',2,'3.0']) >>> a
array([‘a’, ‘b’, ‘2’, ‘3.0’], dtype='<U3′)
>>> type(a) <class 'numpy.ndarray'> >>> a.ndim
1
>>> a.shape (4,) >>> a.size
4
>>> a.dtype dtype('<U3')
Let’s Explore the Comparison Between Python Iterators and Generators
We can also perform operations like:
>>> b=np.array([[1,2,3],[4,5,6]]) >>> b
array([[1, 2, 3],
[4, 5, 6]])
>>> b.flatten()
array([1, 2, 3, 4, 5, 6])
>>> b.reshape(3,2)
array([[1, 2],
[3, 4],
[5, 6]])
>>> b[:2,::2]
array([[1, 3],
[4, 6]])
>>> b-4
array([[-3, -2, -1],
[ 0, 1, 2]])
>>> b.sum()
21
>>> b-2*b
array([[-1, -2, -3],
[-4, -5, -6]])
>>> np.sort(np.array([[3,2,1],[5,2,4]]))
array([[1, 2, 3],
[2, 4, 5]])
4. Python Data Cleansing Operations on Data Using pandas
Pandas use three types to hold data- DataFrame, Panel, and Series.
a. DataFrame
Pandas DataFrame is a data structure that holds data in two dimensions- as rows and columns. We have the following syntax-
pandas.DataFrame(data, index, columns, dtype, copy)
Now let’s try an example-
>>> import pandas as pd >>> data={'Element':['Silver','Gold','Platinum','Copper'],'Atomic Number':[47,79,78,29]} >>> frame=pd.DataFrame(data,index=['element 1','element 2','element 3','element 4']) >>> frame
Have a Look at Python Inheritance, Method Overloading and Method Overriding
b. Panel
Pandas panel holds data in three dimensions. Etymologically, the term panel data from one source for the name pandas. A panel has the following syntax:
pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
>>> data={'Red':pd.DataFrame(np.random.randn(4,2)), 'Blue':pd.DataFrame(np.random.randn(4,3))} >>> pd.Panel(data)
<class ‘pandas.core.panel.Panel’>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Blue to Red
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2
c. Series
Pandas Series holds data in one dimension, in a labeled format. The index is the set of axis labels we use.
It has the following syntax-
pandas.Series(data, index, dtype, copy)
Let’s take an example.
>>> data=np.array([1,2,3,3,4]) >>> pd.Series(data)
0 1
1 2
2 3
3 3
4 4
dtype: int32
Let’s take another example.
>>> pd.Series(np.array(['a','c','b']))
0 a
1 c
2 b
dtype: object
Using these data structures, we can manipulate data in many ways-
>>> frame.iloc[0:2,:]
>>> frame.describe()
>>> frame.rank()
This is all for now; we will learn about the libraries pandas and numpy in their own tutorials.
Read about Python Iterables and Python Itertools with Examples
5. Python Data Cleansing
When some part of our data is missing, due to whichever reason, the accuracy of our predictions plummets. In our article on data wrangling and aggregation, we discussed missing data and how to drop it. Let’s see how we can deal with this issue.
In real-time situations like the comment section of our website. The name and email are mandatory, but the input for ‘website’ can be left empty. Some users may not run a website to be eligible to fill in this information. In ways like this and others, we may end up with missing data in some places. How should we go about with this? Let’s find out.
Python Pandas will depict a missing value as NaN, which is short for Not a Number. Simply using the reindex() method will fill in NaN for blank values.
>>> frame=pd.DataFrame(np.random.randn(4,3),index=[1,2,4,7],columns=['A','B','C']) >>> frame.reindex([1,2,3,4,5,6,7])
a. Finding which columns have missing values
In the tutorial on wrangling, we saw how to find out which columns have missing values-
>>> frame=frame.reindex([1,2,3,4,5,6,7]) >>> frame['B'].isnull()
1Â Â False
2Â Â False
3Â Â True
4Â Â False
5Â Â True
6Â Â True
7Â Â False
Name: B, dtype: bool
6. Ways to Cleanse Missing Data in Python
To perform a Python data cleansing, you can drop the missing values, replace them, replace each NaN with a scalar value, or fill forward or backward.
a. Dropping Missing Values
You can exclude missing values from your dataset using the dropna() method.
>>> frame.dropna()
This defaults to dropping on axis=0, which excludes an entire row for an NaN value.
Do you know the Python Modules vs Packages
b. Replacing Missing Values
To replace each NaN we have in the dataset, we can use the replace() method.
>>> from numpy import NaN >>> frame.replace({NaN:0.00})
This way, we can also replace any value that we find enough times in the dataset.
c. Replacing with a Scalar Value
We can use the fillna() method for this.
>>> frame.fillna(7)
d. Filling Forward or Backward
If we supply a method parameter to the fillna() method, we can fill forward or backward as we need. To fill forward, use the methods pad or fill, and to fill backward, use bfill and backfill.
>>> frame.fillna(method='pad')
>>> frame.fillna(method='backfill')
Follow the link to know about Python Property – The Problem and Solution
7. Python Data Cleansing – Other Operations
While cleaning data, we may also need to find out more about it and manipulate it. Below, we make use of some of these operations.
>>> data={'Element':['Silver','Gold','Platinum','Copper'],'Atomic Number':[47,79,78,29]} >>> frame=pd.DataFrame(data,index=['element 1','element 2','element 3','element 4']) >>> frame
True
>>> frame.head()
Data Cleansing Operations in Python
>>> frame.tail(3)
a. Renaming Columns
To rename a column, you can use the rename() method.
>>> frame.rename(columns={'Atomic Number':'Number','Element':'Name'},inplace=True) >>> frame
Let’s revise Python Multithreading: A Comprehensive Tutorial
b. Making Changes Stay
Also, throughout this tutorial “Python Data Cleansing”, the changes that we have made to the frames did not actually modify them. To make this happen, you can set the inplace=True parameter.
So, this was all about Python Data Cleansing Tutorial. Hope you like our explanation.
8. Conclusion
Hence, in this Python Data Cleansing, we learned how data is Cleans In Python Programming Language for this purpose, we used two libraries- pandas and numpy. Since data scientists spend 80% of their time cleaning and manipulating data, that makes it an essential skill to learn with data science. Tell us what you think in the comments below.
See Also – How Python Send Email Via SMTP
For reference
Did you like this article? If Yes, please give DataFlair 5 Stars on Google