Train and Test Set in Python Machine Learning – How to Split

1. Objective

In our last session, we discussed Data Preprocessing, Analysis & Visualization in Python ML. Now, in this tutorial, we will learn how to split a CSV file into Train and Test Data in Python Machine Learning. Moreover, we will learn prerequisites and process for Splitting a dataset into Train data and Test set in Python ML.

So, let’s begin How to Train & Test Set in Python Machine Learning.

Train and Test Set in Python Machine Learning - How to Split

Train and Test Set in Python Machine Learning – How to Split

2. Training and Test Data in Python Machine Learning

As we work with datasets, a machine learning algorithm works in two stages. We usually split the data around 20%-80% between testing and training stages. Under supervised learning, we split a dataset into a training data and test data in Python ML.

Train and Test Set in Python Machine Learning

Train and Test Set in Python Machine Learning

a. Prerequisites for Train and Test Data
We will need the following Python libraries for this tutorial- pandas and sklearn.
We can install these with pip-

pip install pandas
pip install sklearn

We use pandas to import the dataset and sklearn to perform the splitting. You can import these packages as-

>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.datasets import load_iris

Do you Know about Python Data File Formats – How to Read CSV, JSON, XLS 

3. How to Split Train and Test Set in Python Machine Learning?

Following are the process of Train and Test set in Python ML. So, let’s take a dataset first.

How to Split Train and Test Set in Python Machine Learning

How to Split Train and Test Set in Python Machine Learning

a. Loading the Dataset

Let’s load the forestfires dataset using pandas.

>>> data=pd.read_csv('forestfires.csv')
>>> data.head()
Train and Test Set in Python Machine Learning

Train and Test Set in Python Machine Learning

b. Splitting

Let’s split this data into labels and features. Now, what’s that? Using features, we predict labels. I mean using features (the data we use to predict labels), we predict labels (the data we want to predict).

>>> y=data.temp
>>> x=data.drop('temp',axis=1)

Temp is a label to predict temperatures in y; we use the drop() function to take all other data in x. Then, we split the data.

>>> x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
>>> x_train.head()
Train and Test Set in Python Machine Learning

Train and Test Set in Python Machine Learning

>>> x_train.shape

(413, 12)
Do you Know How to Work with Relational Database with Python

>>> x_test.head()
Train and Test Set in Python Machine Learning

Train and Test Set in Python Machine Learning

>>> x_test.shape

(104, 12)
The line test_size=0.2 suggests that the test data should be 20% of the dataset and the rest should be train data. With the outputs of the shape() functions, you can see that we have 104 rows in the test data and 413 in the training data.

c. Another Example

Let’s take another example. We’ll use the IRIS dataset this time.

>>> iris=load_iris()
>>> x,y=iris.data,iris.target
>>> x_train,x_test,y_train,y_test=train_test_split(x,y,
train_size=0.5,
test_size=0.5,
random_state=123)
>>> y_test

array([1, 2, 2, 1, 0, 2, 1, 0, 0, 1, 2, 0, 1, 2, 2, 2, 0, 0, 1, 0, 0, 2,
0, 2, 0, 0, 0, 2, 2, 0, 2, 2, 0, 0, 1, 1, 2, 0, 0, 1, 1, 0, 2, 2,
2, 2, 2, 1, 0, 0, 2, 0, 0, 1, 1, 1, 1, 2, 1, 2, 0, 2, 1, 0, 0, 2,
1, 2, 2, 0, 1, 1, 2, 0, 2])

>>> y_train

array([1, 1, 0, 2, 2, 0, 0, 1, 1, 2, 0, 0, 1, 0, 1, 2, 0, 2, 0, 0, 1, 0,
0, 1, 2, 1, 1, 1, 0, 0, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 1, 2, 0, 0,
1, 2, 2, 2, 2, 0, 1, 0, 1, 1, 0, 1, 2, 1, 2, 2, 0, 1, 0, 2, 2, 1,
1, 2, 2, 1, 0, 1, 1, 2, 2])
Let’s explore Python Machine Learning Environment Setup 

Python Interview Questions

4. Plotting of Train and Test Set in Python

We fit our model on the train data to make predictions on it. Let’s import the linear_model from sklearn, apply linear regression to the dataset, and plot the results.

>>> from sklearn.linear_model import LinearRegression as lm
>>> model=lm.fit(x_train,y_train)
>>> predictions=lm.predict(x_test)
>>> import matplotlib.pyplot as plt
>>> plt.scatter(y_test,predictions)

<matplotlib.collections.PathCollection object at 0x0651CA30>

>>> plt.xlabel('True values')

Text(0.5,0,’True values’)

>>> plt.ylabel('Predictions')

Text(0,0.5,’Predictions’)
Read about Python NumPy – NumPy ndarray & NumPy Array

>>> plt.show()
Train and Test Set in Python Machine Learning

Train and Test Set in Python Machine Learning

0.9396299518034936
So, this was all about Train and Test Set in Python Machine Learning. Hope you like our explanation.

5. Conclusion

Today, we learned how to split a CSV or a dataset into two subsets- the training set and the test set in Python Machine Learning. We usually let the test set be 20% of the entire data set and the rest 80% will be the training set. Furthermore, if you have a query, feel to ask in the comment box.
Related Topic- Python Geographic Maps & Graph Data
For reference

9 Responses

  1. Jeff Yang says:

    In item 4, it’s missing

    from sklearn.linear_model import LinearRegression

    • DataFlair Team says:

      Hello Jeff,
      Thanks for connecting us with Train & Test set in Python Machine Learning. We have made the necessary changes. Hope, you are enjoying our other Python tutorials.
      Keep learning and keep sharing
      DataFlair

  2. amilcar dsilva says:

    >>> model=lm.fit(x_train,y_train)
    >>> predictions=lm.predict(x_test)

    What is “lm” use for?

    • DataFlair Team says:

      Hey Amilcar

      Thanks for the query. In this Python Train and Test, article lm stands for Linear Model. You’ll need to import it from sklearn:

      >>> from sklearn import linear_model as lm

      Hope, it will help you!

      Regards
      DataFlair

  3. carlos2 says:

    in spider need
    from sklearn.linear_model import LinearRegression
    lm = LinearRegression()

    • DataFlair Team says:

      Hi Carlos,
      That’s right, we have made the changes to the code. Now, you can learn the train test set in Python ML easily.

  4. simran kaur jolly says:

    ‘module’ object has no attribute ‘fit’

    on running lm.fit i am getting following error.

    • DataFlair Team says:

      Hello Simran,
      Thanks for connecting us through this query. We have made the necessary changes. Now, you can enjoy your learning.

  5. rathankar rao says:

    is it possible to set the test and training set with the same pattern
    Eg: if training test has weight ranging from 50kg to 70kg and that too with a certain frequency distribution, is it possible to have a similar distribution in the test set too

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.