Learn Python Stemming and Lemmatization – Python NLTK

Python course with 57 real-time projects - Learn Python

1. Python Stemming and Lemmatization

In this Python Stemming tutorial, we will discuss Stemming and Lemmatization in Python Programming Language– two basics when working with data science in Python. Moreover, we will discuss Python NLTK and Python Stemming examples. Along with this, we will learn Python Stemming vs Lemmatization.

So, let’s begin Python Stemming and Lemmatization.

Python Stemming and Lemmatization - NLTK

Python Stemming and Lemmatization – NLTK

2. Prerequisites for Python Stemming and Lemmatization

For our purpose, we will use the following library-

a. Python NLTK

Python NLTK is an acronym for Natural Language Toolkit. It is a set of libraries that let us perform Natural Language Processing (NLP) on English with Python. It lets us do so in a symbolic and statistical way. It also provides sample data and supports graphical representation.

Do you How Python Rename File – Single & Multiple Files With Example

You can install it using pip-
C:\Users\lifei>pip install nltk
Collecting nltk

Downloading

https://files.pythonhosted.org/packages/50/09/3b1755d528ad9156ee7243d52aa5cd2b809ef053a0f31b53d92853dd653a/nltk-3.3.0.zip (1.4MB)
100% |████████████████████████████████| 1.4MB 669kB/s

Requirement already satisfied: six in c:\users\lifei\appdata\local\programs\python\python36\lib\site-packages (from nltk) (1.11.0)

Installing collected packages: nltk
Running setup.py install for nltk … done
Successfully installed nltk-3.3

3. What is Python Stemming?

Python Stemming is the act of taking a word and reducing it into a stem. A stem is like a root for a word- that for writing is writing. But this doesn’t always have to be a word; words like study, studies, and studying all stem into the word studi, which isn’t actually a word.

Python Stemming and Lemmatization - NLTK

Python Lemmatization and Stemming – Python NLTK

It is almost like these words are synonyms; this lets us normalize sentences and makes searching for words easier and faster. The Python stemming algorithms we have are often based on rules applying to suffix-stripping. The most common is the Porter-Stemmer, which has been around since 1979.
Read about Python Read And Write File – File Handling In Python

a. Python Stemming Individual Words

>>>import nltk
>>> from nltk.stem import PorterStemmer
>>> words=['write','writer','writing','writers']
>>> ps=PorterStemmer()
>>> for word in words:
          print(f"{word}: {ps.stem(word)}")

Output- 
write: write
writer: writer
writing: write
writers: writer
Now let’s try some more words.

>>> ps.stem('written')

‘written’

>>> ps.stem('wrote')

‘wrote’

>>> ps.stem('writable')

‘writabl’

>>> ps.stem('writes')

‘write’

b. Another Example of Python Stemming

Let’s try more words.

>>> ps.stem('game')

‘game’

>>> ps.stem('gaming')

‘game’

>>> ps.stem('gamed')

‘game’

>>> ps.stem('games')

‘game’
Let’s Explore Difference Between Method and Function in Python

c. Python Stemming an Entire Sentence

>>> from nltk.tokenize import word_tokenize
>>> nltk.download('punkt')
>>> sentence='I am enjoying writing this tutorial; I love to write and I have written 266 words so far. I wrote more than you did; I am a writer.'
>>> words=word_tokenize(sentence)
>>> for word in words:
          print(f"{word}: {ps.stem(word)}")             

I: I
am: am
enjoying: enjoy
writing: write
this: thi
tutorial: tutori
;: ;
I: I
love: love
to: to
write: write
and: and
I: I
have: have
written: written
266: 266
words: word
so: so
far: far
.: .
I: I
wrote: wrote
more: more
than: than
you: you
did: did
;: ;
I: I
am: am
a: a
writer: writer
.: .

4. What is Python Lemmatization?

Python Lemmatization lets us group together inflected forms of a word. It links words with similar meanings to one word and maps various words onto one root.
Let’s Discuss Python Regular Expressions | Python Regex Tutorial

Learn Python Stemming and Lemmatization - NLTK

What is Python Lemmatization?

a. Python Stemming vs Lemmatization

But how Python Lemmatization is different from stemming? While stemming can create words that do not actually exist, Python lemmatization will only ever result in words that do. lemmas are actual words.

>>> ps.stem('indetify')

‘indetifi’

>>> lemmatizer.lemmatize('identify')

‘identify’

b. Python Lemmatization Examples

>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer=WordNetLemmatizer()
>>> nltk.download('wordnet')
>>> lemmatizer.lemmatize('dogs')

‘dog’

>>> lemmatizer.lemmatize('geese')

‘goose’

>>> lemmatizer.lemmatize('cacti')

‘cactus’

>>> lemmatizer.lemmatize('erasers')

‘eraser’

>>> lemmatizer.lemmatize('children')

‘child’

>>> lemmatizer.lemmatize('feet')

‘foot’
Let’s Learn Python Debugger with Examples

c. Using Pos

>>> lemmatizer.lemmatize('better',pos='a')

‘good’

Here, pos is a speech parameter, which is noun by default. This means Python will try to find the closest noun.

>>> lemmatizer.lemmatize('redder','a')

‘red’
Since, Python lemmatization considers whether a word is a noun, a verb, an adjective, an adverb, and so, Python needs to find out about a word’s context.

So, this was all about Stemming and Lemmatization in Python & Python NLTK. Hope you like our explanation.

5. Conclusion

Hence, in this Python tutorial, we studied Python Stemming and Lemmatization. In addition, we studied NLTK, an example of Stemming and Lemmatization in Python, and the difference between Python Stemming and Lemmatization. Tell us what you think about this Python Lemmatization and Stemming tutorial, in the comments Box.
Related Topic- CGI Programming in Python with Functions and Modules
For reference

Your opinion matters
Please write your valuable feedback about DataFlair on Google

follow dataflair on YouTube

4 Responses

  1. amilcar dsilva says:

    What is the use of the word ‘punkt’ in the code snippet below?
    Thanking you in advance for your explanation.

    >>> from nltk.tokenize import word_tokenize
    >>> nltk.download(‘punkt’)

    >>> sentence=’I am enjoying writing this tutorial; I love to write and I have written 266 words so far. I wrote more than you did; I am a writer.’
    >>> words=word_tokenize(sentence)
    >>> for word in words:
    print(f”{word}: {ps.stem(word)}”)

    • DataFlair Team says:

      Hello, Amilcar

      Thanks for the appreciation and comment for Python Stremming. Here, Punkt is a sentence tokenizer that takes text and divides it into a list of sentences. It does so using an unsupervised algorithm, and before we use it, we must train it on a huge collection of plaintext.

      Its purpose is to build a model for sentence-starter words, abbreviations, and collocations. For English, NLTK ships with a pre-trained Punkt tokenizer. In German, ‘punkt’ means ‘point’.
      Hope, you find it useful!
      Regards,
      DataFlair

  2. nagranjeet says:

    Thank for Good content on this,
    Here what is use of” wordnet’ ?

    • DataFlair says:

      WordNet is a word database of english words which contains nouns, verbs, adverbs, and adjectives. After installing nltk you have to download WordNet package to use it.

Leave a Reply

Your email address will not be published. Required fields are marked *