Improve your Resume Match score with job description using Python and NLTK

Created by Sriharsha Velicheti in Articles 4 Feb 2024

The Answer is about to arrive! Today, we're going to build a simple Streamlit application using Python and NLTK that pulls text from the job description and your resume. When you click the Match button, magic happens, revealing the percentage of your resume that matches the job description. I'll also let you know how well the match rate was and, if necessary, suggest some key words. Isn't this awesome?

If you're excited and think it sounds cool, let's get started developing our project with Python, Streamlit, NLTK, and Sklearn. Here, you may observe how the completed application functions.

here

Let’s Begin…

Before directly jumping into the coding, let me give you a high level overview of our project. You can break down entire project into 3 parts as follows
Match Percentage calculation
Keyword Extraction from JD
Streamlit for UI

Lets tackle one by one, first we will import all the necessary libraries to make our jobs easier

import streamlit as st
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

Streamlit (streamlit):Streamlit is a Python library used to create web applications for data science and machine learning projects. It simplifies web app development, enabling data professionals to present their work interactively without extensive front-end knowledge.
Scikit-learn (sklearn):Scikit-learn is a comprehensive machine learning library in Python. It offers various tools for data analysis, model building, and evaluation.here we use sklearn for the following

CountVectorizer (from sklearn.feature_extraction.text):
Converts text data into numerical vectors, tallying word occurrences, and enabling analysis in machine learning models.
Cosine Similarity (from sklearn.metrics.pairwise):
Measures the similarity between two numerical vectors, often used to compare the resemblance of text documents orNLTK, or Natural Language Toolkit, is a potent Python package for tasks related to natural language processing (NLP). It offers features for part-of-speech tagging, tokenization, stemming, and more. It is essential for text processing and analysis, facilitating activities like text mining, sentiment analysis, and language interpretation.
NLTK Stopwords: During preprocessing, frequently used terms are eliminated from text by using the stopwords module of NLTK. Eliminating these stopwords improves analysis of textual data by helping to concentrate on the important words.

Word tokenization (NLTK Tokenize): The word tokenization function in NLTK divides text into discrete words. It's essential for breaking up text into manageable chunks, making text analysis easier by enabling the alteration and scrutiny of specific words or phrases.

Thats it about the introduction of modules that we use in our project now we will see about setting up and fixing environmental issues while using these libraries and also how to download nltk stopword packages

nltk.download('punkt')
nltk.download('stopwords')

os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

nltk.download(‘punkt’):Downloads NLTK’s Punkt tokenizer models, essential for tokenizing sentences into words or tokens.
nltk.download(‘stopwords’):Fetches NLTK’s collection of stopwords, common words often excluded during text analysis to focus on meaningful content.
os.environ[‘PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION’] = ‘python’:Specifies the protocol buffer implementation in Python, used for efficient serialization and deserialization of structured data.

If you are confused about terms like tokenizing , stopwords. I will try to explain in short about them in the next paragraph. if you are already comfortable with you can skip the next para.

Tokenizing: Tokenizing refers to the process of breaking down a piece of text into smaller units, usually words or sentences, called tokens. It helps in organizing and analyzing textual data at a more granular level.

Example: Consider the sentence: “The quick brown fox jumps over the lazy dog.” Tokenizing this sentence would result in individual words being extracted as tokens:[“The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “.”]

Each word in the sentence becomes a separate token, making it easier to analyze or process.

Stopwords: Stopwords are common words in a language that often don’t carry significant meaning in a specific context and are typically filtered out during text analysis to focus on more meaningful words.

Example: In English, stopwords may include words like “the,” “is,” “at,” “and,” etc. Consider the sentence: “The art of simplicity is a puzzle of complexity.” When stopwords are removed, the sentence focuses on essential words:

Original: “The art of simplicity is a puzzle of complexity.”
After removing stopwords: “art simplicity puzzle complexity.”

Lets go back to our main workflow

Creating the Match Function:

we are going to code the function to calculate the match percentage between two texts here

def calculate_match_percentage(text1, text2):
    vectorizer = CountVectorizer().fit_transform([text1, text2])
    vectors = vectorizer.toarray()
    cosine_sim = cosine_similarity(vectors)
    match_percentage = cosine_sim[0, 1] * 100  # Convert to percentage
    return match_percentage

Input:

Takes in two texts, text1 and text2, for comparison. where text1 represents Resume text and text2 represents JD text respectively

2. Vectorization:

Utilizes CountVectorizer() to transform the texts into numerical vectors.
Each word becomes a dimension, and the count of occurrences represents the values in the vector.
For instance, if text1 has three times "apple" and text2 has one "apple," it's represented as vectors in a multi-dimensional space: [3, 0, ...] and [1, 0, ...].

3. Cosine Similarity:

Calculates the similarity between these vectors using cosine_similarity.
It measures the cosine of the angle between these vectors in a multi-dimensional space.
The formula for cosine similarity between two vectors A and B is:

cosine_sim = (A dot B) / (||A|| * ||B||)

Where A dot B is the dot product of vectors A and B, and ||A|| and ||B|| are their magnitudes (lengths).

4. Match Percentage:

Extracts the similarity score from cosine_similarity.
This score ranges between 0 (no match) to 1 (perfect match).

5. Conversion to Percentage:

Multiplies the similarity score by 100 to get a match percentage.
This percentage scale is easier to interpret: 0% (no match) to 100% (identical texts).

Using this match Function we can calculate the Match percentage between Resume text and Job Description text. Our next step is to get the Keywords from both texts.

Keyword Extraction

Here our aim is to filter the stopwords and distill the keywords from the text:

def extract_key_terms(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text.lower())
    filtered_words = [w for w in word_tokens if not w in stop_words and w.isalpha()]

    return set(filtered_words)

Stopwords Removal:

Creates a set of common English stopwords to filter out non-essential words like “the,” “is,” etc.

2. Tokenization and Lowercasing:

Breaks down the text into individual words (tokens) using NLTK’s word_tokenize.
Converts the tokens into lowercase to ensure uniformity.

3. Filtering Non-Stopwords:

Removes stopwords and non-alphabetic characters from the tokenized text.
Returns a set containing only the meaningful, filtered words from the original text.

So far We were done with the logic part of the code now lets create a simple and neat UI using streamlit

Creating UI using Streamlit

Here we are using Streamlit create the frontend part of our project which makes it easier to build instead of using HTML or CSS.

# Streamlit UI
st.title('Resume_JD Scorer')

# Text areas for user input
resume_text = st.text_area("Paste Your Resume Here")
jd_text = st.text_area("Paste Job Description Here")

if st.button('Match'):
    if resume_text and jd_text:
        # Calculate the match percentage
        match_percentage = calculate_match_percentage(resume_text, jd_text)
        st.write(f"Match Percentage: {match_percentage:.2f}%")

        # Extracting key terms from JD and checking against the resume
        jd_terms = extract_key_terms(jd_text)
        resume_terms = extract_key_terms(resume_text)
        missing_terms = jd_terms - resume_terms

        if match_percentage >= 70:
            st.success("Good Chances of getting your Resume Shortlisted.")
        elif 40 <= match_percentage < 70:
            st.warning("Good match but can be improved.")
            if missing_terms:
                st.info(f"Consider adding these key terms from the job description to your resume:\n {', '.join(missing_terms)}")
        elif match_percentage < 40:
            st.error("Poor match.")
            if missing_terms:
                st.info(f"Your resume is missing these key terms from the job description:\n {', '.join(missing_terms)}")
    else:
        st.warning("Please enter both Resume and Job Description.")

The code above is essentially self-explanatory; to learn more about the attributes of streamlit, go to the documentation for streamlit.

Using `resume_txt} and `jd_txt} as its variables, it intelligently labels input places for users to paste their resume and job description using Streamlit's features.

A "Match" button instantly displays match percentages and initiates comparison calculations. The interface provides users with customized feedback based on this percentage, advising them on how well their resume matches the job description. It's a smooth, user-friendly application that provides instantaneous insights and practical recommendations for optimizing resume content.

You should type the following command into your terminal to execute this on your local machine.

streamlit run file_name.py

As soon as you hit entire you will something like picture 1.0 and the app is up and running you can put this file into github and then deploy using streamlit cloud for making it publically available make sure to add requirments.txt file while deploying.

all the code at once in summary:

import streamlit as st
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'python'

# Rest of your Streamlit code

# Function to calculate the match percentage between two texts
def calculate_match_percentage(text1, text2):
    vectorizer = CountVectorizer().fit_transform([text1, text2])
    vectors = vectorizer.toarray()
    cosine_sim = cosine_similarity(vectors)
    match_percentage = cosine_sim[0, 1] * 100  # Convert to percentage
    return match_percentage


def extract_key_terms(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text.lower())
    filtered_words = [w for w in word_tokens if not w in stop_words and w.isalpha()]

    return set(filtered_words)

# Streamlit UI
st.title('Resume_JD Scorer')

# Text areas for user input
resume_text = st.text_area("Paste Your Resume Here")
jd_text = st.text_area("Paste Job Description Here")
if st.button('Match'):
    if resume_text and jd_text:
        # Calculate the match percentage
        match_percentage = calculate_match_percentage(resume_text, jd_text)
        st.write(f"Match Percentage: {match_percentage:.2f}%")

        # Extracting key terms from JD and checking against the resume
        jd_terms = extract_key_terms(jd_text)
        resume_terms = extract_key_terms(resume_text)
        missing_terms = jd_terms - resume_terms

        if match_percentage >= 70:
            st.success("Good Chances of getting your Resume Shortlisted.")
        elif 40 <= match_percentage < 70:
            st.warning("Good match but can be improved.")
            if missing_terms:
                st.info(f"Consider adding these key terms from the job description to your resume:\n {', '.join(missing_terms)}")
        elif match_percentage < 40:
            st.error("Poor match.")
            if missing_terms:
                st.info(f"Your resume is missing these key terms from the job description:\n {', '.join(missing_terms)}")
    else:
        st.warning("Please enter both Resume and Job Description.")