Train and Test Set in Python Machine Learning – How to Split

1. Objective

In our last session, we discussed Data Preprocessing, Analysis & Visualization in Python ML. Now, in this tutorial, we will learn how to split a CSV file into Train and Test Data in Python Machine Learning. Moreover, we will learn prerequisites and process for Splitting a dataset into Train data and Test set in Python ML.

So, let’s begin How to Train & Test Set in Python Machine Learning.

Train and Test Set in Python Machine Learning - How to Split

Train and Test Set in Python Machine Learning – How to Split

2. Training and Test Data in Python Machine Learning

As we work with datasets, a machine learning algorithm works in two stages. We usually split the data around 20%-80% between testing and training stages. Under supervised learning, we split a dataset into a training data and test data in Python ML.

Train and Test Set in Python Machine Learning

Train and Test Set in Python Machine Learning

a. Prerequisites for Train and Test Data
We will need the following Python libraries for this tutorial- pandas and sklearn.
We can install these with pip-

pip install pandas
pip install sklearn

We use pandas to import the dataset and sklearn to perform the splitting. You can import these packages as-

>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.datasets import load_iris

Do you Know about Python Data File Formats – How to Read CSV, JSON, XLS 

3. How to Split Train and Test Set in Python Machine Learning?

Following are the process of Train and Test set in Python ML. So, let’s take a dataset first.

How to Split Train and Test Set in Python Machine Learning

How to Split Train and Test Set in Python Machine Learning

a. Loading the Dataset

Let’s load the forestfires dataset using pandas.

>>> data=pd.read_csv('forestfires.csv')
>>> data.head()
Train and Test Set in Python Machine Learning

Train and Test Set in Python Machine Learning

b. Splitting

Let’s split this data into labels and features. Now, what’s that? Using features, we predict labels. I mean using features (the data we use to predict labels), we predict labels (the data we want to predict).

>>> y=data.temp
>>> x=data.drop('temp',axis=1)

Temp is a label to predict temperatures in y; we use the drop() function to take all other data in x. Then, we split the data.

>>> x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
>>> x_train.head()
Train and Test Set in Python Machine Learning

Train and Test Set in Python Machine Learning

>>> x_train.shape

(413, 12)
Do you Know How to Work with Relational Database with Python

>>> x_test.head()
Train and Test Set in Python Machine Learning

Train and Test Set in Python Machine Learning

>>> x_test.shape

(104, 12)
The line test_size=0.2 suggests that the test data should be 20% of the dataset and the rest should be train data. With the outputs of the shape() functions, you can see that we have 104 rows in the test data and 413 in the training data.

c. Another Example

Let’s take another example. We’ll use the IRIS dataset this time.

>>> iris=load_iris()
>>> x,y=iris.data,iris.target
>>> x_train,x_test,y_train,y_test=train_test_split(x,y,
train_size=0.5,
test_size=0.5,
random_state=123)
>>> y_test

array([1, 2, 2, 1, 0, 2, 1, 0, 0, 1, 2, 0, 1, 2, 2, 2, 0, 0, 1, 0, 0, 2,
0, 2, 0, 0, 0, 2, 2, 0, 2, 2, 0, 0, 1, 1, 2, 0, 0, 1, 1, 0, 2, 2,
2, 2, 2, 1, 0, 0, 2, 0, 0, 1, 1, 1, 1, 2, 1, 2, 0, 2, 1, 0, 0, 2,
1, 2, 2, 0, 1, 1, 2, 0, 2])

>>> y_train

array([1, 1, 0, 2, 2, 0, 0, 1, 1, 2, 0, 0, 1, 0, 1, 2, 0, 2, 0, 0, 1, 0,
0, 1, 2, 1, 1, 1, 0, 0, 1, 2, 0, 0, 1, 1, 1, 2, 1, 1, 1, 2, 0, 0,
1, 2, 2, 2, 2, 0, 1, 0, 1, 1, 0, 1, 2, 1, 2, 2, 0, 1, 0, 2, 2, 1,
1, 2, 2, 1, 0, 1, 1, 2, 2])
Let’s explore Python Machine Learning Environment Setup 

Python Interview Questions

4. Plotting of Train and Test Set in Python

We fit our model on the train data to make predictions on it. Let’s import the linear_model from sklearn, apply linear regression to the dataset, and plot the results.

>>> from sklearn.linear_model import LinearRegression as lm
>>> model=lm().fit(x_train,y_train)
>>> predictions=model.predict(x_test)
>>> import matplotlib.pyplot as plt
>>> plt.scatter(y_test,predictions)

<matplotlib.collections.PathCollection object at 0x0651CA30>

>>> plt.xlabel('True values')

Text(0.5,0,’True values’)

>>> plt.ylabel('Predictions')

Text(0,0.5,’Predictions’)
Read about Python NumPy – NumPy ndarray & NumPy Array

>>> plt.show()
Train and Test Set in Python Machine Learning

Train and Test Set in Python Machine Learning

0.9396299518034936
So, this was all about Train and Test Set in Python Machine Learning. Hope you like our explanation.

5. Conclusion

Today, we learned how to split a CSV or a dataset into two subsets- the training set and the test set in Python Machine Learning. We usually let the test set be 20% of the entire data set and the rest 80% will be the training set. Furthermore, if you have a query, feel to ask in the comment box.
Related Topic- Python Geographic Maps & Graph Data
For reference

18 Responses

  1. Jeff Yang says:

    In item 4, it’s missing

    from sklearn.linear_model import LinearRegression

    • DataFlair Team says:

      Hello Jeff,
      Thanks for connecting us with Train & Test set in Python Machine Learning. We have made the necessary changes. Hope, you are enjoying our other Python tutorials.
      Keep learning and keep sharing
      DataFlair

  2. amilcar dsilva says:

    >>> model=lm.fit(x_train,y_train)
    >>> predictions=lm.predict(x_test)

    What is “lm” use for?

    • DataFlair Team says:

      Hey Amilcar

      Thanks for the query. In this Python Train and Test, article lm stands for Linear Model. You’ll need to import it from sklearn:

      >>> from sklearn import linear_model as lm

      Hope, it will help you!

      Regards
      DataFlair

  3. carlos2 says:

    in spider need
    from sklearn.linear_model import LinearRegression
    lm = LinearRegression()

    • DataFlair Team says:

      Hi Carlos,
      That’s right, we have made the changes to the code. Now, you can learn the train test set in Python ML easily.

  4. simran kaur jolly says:

    ‘module’ object has no attribute ‘fit’

    on running lm.fit i am getting following error.

    • DataFlair Team says:

      Hello Simran,
      Thanks for connecting us through this query. We have made the necessary changes. Now, you can enjoy your learning.

  5. rathankar rao says:

    is it possible to set the test and training set with the same pattern
    Eg: if training test has weight ranging from 50kg to 70kg and that too with a certain frequency distribution, is it possible to have a similar distribution in the test set too

  6. Kun Thaung says:

    I just found the error in you post.
    i learn from this post.

    model=lm.fit(x_train,y_train)
    there is an error in this model.
    we have to use lm().fit(x_train,y_train)

    >>> model=lm.fit(x_train,y_train)
    it is error to use lm in this predict here
    >>> predictions=lm.predict(x_test)
    we should write the code
    predictions=model.predict(x_test)

    i had fixed like this to get our output correctly
    Thank you for this post

    • DataFlair Team says:

      Hi Kun Thaung

      Thank you for pointing it out! Careful readers like you help make our content accurate and flawless for many others to follow. We have made the necessary corrections in the text.

  7. Tushar Kadam says:

    I have been given a task to predict the missing ratings. I have done that using the cosine similarity and some functions used in collaborative recommendations.
    So, now I have two datasets.
    1. The training set which was already 80% of the original data.
    2. The test data set which is 20% and the non-zero ratings are available.
    Now, I want to calculate the RMSE between the available ratings in test set and the predicted ratings in training dataset. Please guide me how should I proceed.

    The testdata set and train data set are nothing but the data of user*item matrix. Where indexes of the rows represent the users and indexes of the column represent the items.

  8. YUVAKUMAR R says:

    hi
    am getting the error “ValueError: could not convert string to float: ‘sep'” against the line “model = lm().fit(x_train, y_train)”. Can you pls help . I have imported all required packages, and am using pycharm ide.

    Thanks

    • DataFlair Team says:

      Hello Yuvakumar R,
      Maybe you have issues with your dataset- like missing values. Try downloading the forestfires dataset from Kaggle and run the code again, it should work. Or maybe you’re missing a step?

  9. Sudhanshu varun says:

    Hello
    Can you please tell me how i can use this sklearn for training python with another language i have the dataset need i am not able to understand how do i split it into test and train dataset.

  10. Francine says:

    thank you for your post, it helps more. but i have a question, why we predict on x_test i think we can predict on y_test? is it the same? please help me .

    thanks

    • DataFlair Team says:

      Hi Francine

      Thanks for commenting. x_test is the test data set and y_test is the set of labels to the data in x_test. We cannot predict on y_test- only on x_test.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.