Tell us how we can help you?
Name
Country
Email
Phone
Message
Receive updates on WhatsApp
By tapping submit, you agree to Machine Learning Plus Privacy Policy & Terms & Conditions

Get a detailed look at our Data Science course
  • Comprehensive Learning Paths
  • 150+ Hours of Videos
  • Complete Access to Jupyter notebooks, Datasets, References.
Rating
4.89/5
Ratings
Users
57K+
Active Learners
Full Name
Email
Phone
Country
I would like to be kept up to date with new training programs/events/promotions/marketing.
By submitting this form, I accept Machine Learning Plus Privacy Policy.

Request A Call Back
Please leave us your contact details and our team will call you back.
Name
Country
Email
Phone
Message
Receive updates on WhatsApp
By tapping submit, you agree to Machine Learning Plus Privacy Policy & Terms & Conditions

Skip to content
MLP Logo
Menu
  • Courses
    • Data Science Coding Expert
      • Foundations Of Machine Learning (Free)
      • Python Programming(Free)
      • Numpy For Data Science(Free)
      • Pandas For Data Science(Free)
      • Linux Command Line(Free)
      • SQL for Data Science – I(Free)
      • SQL for Data Science – II(Free)
      • SQL for Data Science – III(Free)
      • SQL for Data Science – Window Functions(Free)
      • Machine Learning Expert
      • Linear Algebra for ML
      • Statistics for Data Science
      • Data Pre-Processing and EDA
      • Linear Regression and Regularisation
      • Classification: Logistic Regression
      • Supervised ML Algorithms
      • Imbalanced Classification
      • Ensemble Learning
      • Time Series Forecasting Expert
      • Introduction to Time Series Analysis
      • Time Series Analysis – I (Beginners)
      • Time Series Analysis – II (Intermediate)
      • Time Series Forecasting Part 1 – Statistical Models
      • Time Series Forecasting Part 2 – ARIMA modeling and Tests
      • Time Series Forecasting Part 3 – Vector Auto Regression
      • Time Series Analysis – III: Singular Spectrum Analysis
      • Feature Engineering for Time Series Projects – Part 1
      • Feature Engineering for Time Series Projects – Part 2
    • Deployment Expert
      • ML Deployment in AWS EC2
      • Deploy ML Models in AWS Lamda
      • Deploy ML Models in AWS Sagemaker
      • PySpark for Data Science – I: Fundamentals
      • PySpark for Data Science – II: Statistics for Big Data
      • PySpark for Data Science – III: Data Cleaning and Analysis
      • PySpark for Data Science – IV: Machine Learning
      • PySpark for Data Science-V : ML Pipelines
      • Deep Learning Expert
      • Foundations Of Deep Learning in Python
      • Foundations Of Deep Learning in Python 2
      • Applied Deep Learning with PyTorch
      • Detecting Defects in Steel Sheets with Computer-Vision
      • Project Text Generation using Language Models with LSTM
      • Project Classifying Sentiment of Reviews using BERT NLP
    • Industry Projects Expert
      • Estimating Customer Lifetime Value for Business
      • Microsoft Malware Detection Project
      • Credit Card Fraud Detection
      • Restaurant Visitor Forecasting
      • Optimizing Marketing Budget Spend with Market Mix Modelling
      • Predict Rating given Amazon Product Reviews using NLP
      • Uplift modeling: Estimating incremental impact of Marketing Campaigns
      • Uplift Modeling Part 2: Modeling-Strategies
      • Survival Analysis: Predicting Time to Event in real world applications
      • Survival Analysis Part 2: Predicting Time to Event for Lungs Cancer Patients
      • Attribution Models in Marketing
      • Dynamic pricing using Multi Armed Bandit (Reinforcement Learning)
      • Reinforcement learning for Online Ad Serving with Multi Armed Bandits
      • MLFlow in Action: Hands on guide to ML experiments
    • Supplementary Courses
      • Base R Programming
      • Dplyr for Data Wrangling
      • Wrangling Data with DataTable
      • GGPlot2 Visualization for Data Analysis
      • Statistical Foundations for ML in R
      • Statistical Modeling with Linear Logistics Regression
      • Caret package in R
      • Spacy for NLP
      • View All Courses
    • Close
  • Blog
    • Resources-old
      • Data Science Project Template
      • Time Series Project Template
      • Numpy Cheatsheets
      • Data Science Projects Bluebook
      • All Resources
    • Practice Exercises
      • 101 NumPy Exercises for Data Analysis (Python)
      • 101 Pandas Exercises for Data Analysis
      • 101 PySpark Exercises for Data Analysis
      • 101 Python datatable Exercises (pydatatable)
      • 101 NLP Exercises (using modern libraries)
      • 101 R data.table Exercises
    • Python
      • Setup Python environment for ML
      • How to speed up Python using Cython
      • Python to Cython in Jupyter
      • How to deal with Big Data in Python for ML Projects (100+ GB)?
      • Decorators in Python – How to enhance functions without changing the code?
      • Generators in Python – How to lazily return values only when needed and save memory?
      • Iterators in Python – What are Iterators and Iterables?
      • Python Module – What are modules and packages in python?
      • Object Oriented Programming (OOPS) in Python
      • Conda virtual environment
      • List Comprehensions in Python – My Simplified Guide
      • Parallel Processing in Python – A Practical Guide with Examples
      • Python @Property Explained – How to Use and When? (Full Examples)
      • pdb – How to use Python debugger
      • Python Regular Expressions Tutorial and Examples: A Simplified Guide
      • Python Logging – Simplest Guide with Full Code and Examples
      • datetime in Python – Simplified Guide with Clear Examples
      • Requests in Python Tutorial – How to send HTTP requests in Python?
      • Python JSON – Guide
      • Python Collections – An Introductory Guide
      • cProfile – How to profile your python code
      • Python Yield – What does the yield keyword do?
      • Lambda Function in Python – How and When to use?
      • What does Python Global Interpreter Lock – (GIL) do?
    • Time Series
      • Granger Causality Test
      • Augmented Dickey Fuller Test (ADF Test) – Must Read Guide
      • KPSS Test for Stationarity
      • ARIMA Model – Complete Guide to Time Series Forecasting in Python
      • Time Series Analysis in Python – A Comprehensive Guide with Examples
      • Vector Autoregression (VAR) – Comprehensive Guide with Examples in Python
    • Statistics
      • Partial Correlation
      • Chi-Square test – How to test statistical significance?
      • Gentle Introduction to Markov Chain
      • What is P-Value? – Understanding the meaning, math and methods
      • How to implement common statistical significance tests and find the p value?
      • Mahalanobis Distance – Understanding the math with examples (python)
      • T Test (Students T Test) – Understanding the math and how it works
      • Confidence Interval – Fully Explained
      • Understanding Standard Error – A practical guide with examples
      • One Sample T Test – Clearly Explained with Examples | ML+
    • Deep Learning
      • TensorFlow vs PyTorch – A Detailed Comparison
      • How to use tf.function to speed up Python code in Tensorflow
      • How to implement Linear Regression in TensorFlow
    • NLP
      • Complete Guide to Natural Language Processing (NLP) – with Practical Examples
      • Text Summarization Approaches for NLP – Practical Guide with Generative Examples
      • 101 NLP Exercises (using modern libraries)
      • Gensim Tutorial – A Complete Beginners Guide
      • LDA in Python – How to grid search best topic models?
      • Topic Modeling with Gensim (Python)
      • Lemmatization Approaches with Examples in Python
      • Topic modeling visualization – How to present the results of LDA models?
      • Cosine Similarity – Understanding the math and how it works (with python codes)
      • spaCy Tutorial – Complete Writeup
      • Training Custom NER models in SpaCy to auto-detect named entities [Complete Guide]
      • Building chatbot with Rasa and spaCy
      • SpaCy Text Classification – How to Train Text Classification Model in spaCy (Solved Example)?
    • Plots
      • Matplotlib Plotting Tutorial – Complete overview of Matplotlib library
      • Matplotlib Histogram – How to Visualize Distributions in Python
      • Bar Plot in Python – How to compare Groups visually
      • Python Boxplot – How to create and interpret boxplots (also find outliers and summarize distributions)
      • Waterfall Plot in Python
      • Top 50 matplotlib Visualizations – The Master Plots (with full python code)
      • Matplotlib Tutorial – A Complete Guide to Python Plot w/ Examples
      • Matplotlib Pyplot – How to import matplotlib in Python and create different plots
      • Python Scatter Plot – How to visualize relationship between two numeric features
      • Matplotlib Line Plot – How to create a line plot to visualize the trend?
      • Matplotlib Subplots – How to create multiple plots in same figure in Python?
    • Machine Learning
      • Main Pitfalls in Machine Learning Projects
      • Deploy ML model in AWS Ec2 – Complete no-step-missed guide
      • Feature selection using FRUFS and VevestaX
      • Simulated Annealing Algorithm Explained from Scratch (Python)
      • Bias Variance Tradeoff – Clearly Explained
      • Complete Introduction to Linear Regression in R
      • Caret Package – A Practical Guide to Machine Learning in R
      • Logistic Regression – A Complete Tutorial With Examples in R
      • Principal Component Analysis (PCA) – Better Explained
      • K-Means Clustering Algorithm from Scratch
      • How Naive Bayes Algorithm Works? (with example and full code)
      • Feature Selection – Ten Effective Techniques with Examples
      • Evaluation Metrics for Classification Models – How to measure performance of machine learning models?
      • Brier Score – How to measure accuracy of probablistic predictions
      • Portfolio Optimization with Python using Efficient Frontier with Practical Examples
      • Gradient Boosting – A Concise Introduction from Scratch
    • Deployment
      • Population Stability Index (PSI)
      • Deploy ML model in AWS Ec2 – Complete no-step-missed guide
    • Julia
      • Julia – Programming Language
      • Linear Regression in Julia
      • Logistic Regression in Julia – Practical Guide with Examples
      • For-Loop in Julia
      • While-loop in Julia
      • Function in Julia
      • DataFrames in Julia
    • Data Wrangling
      • 101 NumPy Exercises for Data Analysis (Python)
      • 101 Pandas Exercises for Data Analysis
      • SQL Tutorial – A Simple and Intuitive Guide to the Structured Query Language
      • Dask – How to handle large dataframes in python using parallel computing
      • Modin – How to speedup pandas by changing one line of code
      • Python Numpy – Introduction to ndarray [Part 1]
      • data.table in R – The Complete Beginners Guide
      • 101 Python datatable Exercises (pydatatable)
      • 101 R data.table Exercises
      • 101 NLP Exercises (using modern libraries)
    • Recent
      • How to deal with Big Data in Python for ML Projects (100+ GB)?
      • Granger Causality Test
      • Main Pitfalls in Machine Learning Projects
      • Population Stability Index (PSI)
      • Deploy ML model in AWS Ec2 – Complete no-step-missed guide
      • Feature selection using FRUFS and VevestaX
      • Object Oriented Programming (OOPS) in Python
      • Simulated Annealing Algorithm Explained from Scratch (Python)
      • Partial Correlation
      • Chi-Square test – How to test statistical significance for categorical data?
      • Conda virtual environment
  • Pricing
  • Testimonials
  • Product
    • Complete Data Science Course (CDS)
      • Data Science Specializations >
        • DS Programming Specialization
        • Machine Learning Specialization
        • Deployment Specialization
        • Forecasting Specialization
        • DS Projects Specialization
        • Deep Learning Specialization
        • Supplementary Courses
    • Projects
    • Store🛒
Menu
  • Blog
    • Resources-old
      • Data Science Project Template
      • Time Series Project Template
      • Numpy Cheatsheets
      • Data Science Projects Bluebook
      • All Resources
    • Practice Exercises
      • 101 NumPy Exercises for Data Analysis (Python)
      • 101 Pandas Exercises for Data Analysis
      • 101 PySpark Exercises for Data Analysis
      • 101 Python datatable Exercises (pydatatable)
      • 101 NLP Exercises (using modern libraries)
      • 101 R data.table Exercises
    • Python
      • Setup Python environment for ML
      • How to speed up Python using Cython
      • Python to Cython in Jupyter
      • How to deal with Big Data in Python for ML Projects (100+ GB)?
      • Decorators in Python – How to enhance functions without changing the code?
      • Generators in Python – How to lazily return values only when needed and save memory?
      • Iterators in Python – What are Iterators and Iterables?
      • Python Module – What are modules and packages in python?
      • Object Oriented Programming (OOPS) in Python
      • Conda virtual environment
      • List Comprehensions in Python – My Simplified Guide
      • Parallel Processing in Python – A Practical Guide with Examples
      • Python @Property Explained – How to Use and When? (Full Examples)
      • pdb – How to use Python debugger
      • Python Regular Expressions Tutorial and Examples: A Simplified Guide
      • Python Logging – Simplest Guide with Full Code and Examples
      • datetime in Python – Simplified Guide with Clear Examples
      • Requests in Python Tutorial – How to send HTTP requests in Python?
      • Python JSON – Guide
      • Python Collections – An Introductory Guide
      • cProfile – How to profile your python code
      • Python Yield – What does the yield keyword do?
      • Lambda Function in Python – How and When to use?
      • What does Python Global Interpreter Lock – (GIL) do?
    • Time Series
      • Granger Causality Test
      • Augmented Dickey Fuller Test (ADF Test) – Must Read Guide
      • KPSS Test for Stationarity
      • ARIMA Model – Complete Guide to Time Series Forecasting in Python
      • Time Series Analysis in Python – A Comprehensive Guide with Examples
      • Vector Autoregression (VAR) – Comprehensive Guide with Examples in Python
    • Statistics
      • Partial Correlation
      • Chi-Square test – How to test statistical significance?
      • Gentle Introduction to Markov Chain
      • What is P-Value? – Understanding the meaning, math and methods
      • How to implement common statistical significance tests and find the p value?
      • Mahalanobis Distance – Understanding the math with examples (python)
      • T Test (Students T Test) – Understanding the math and how it works
      • Confidence Interval – Fully Explained
      • Understanding Standard Error – A practical guide with examples
      • One Sample T Test – Clearly Explained with Examples | ML+
    • Deep Learning
      • TensorFlow vs PyTorch – A Detailed Comparison
      • How to use tf.function to speed up Python code in Tensorflow
      • How to implement Linear Regression in TensorFlow
    • NLP
      • Complete Guide to Natural Language Processing (NLP) – with Practical Examples
      • Text Summarization Approaches for NLP – Practical Guide with Generative Examples
      • 101 NLP Exercises (using modern libraries)
      • Gensim Tutorial – A Complete Beginners Guide
      • LDA in Python – How to grid search best topic models?
      • Topic Modeling with Gensim (Python)
      • Lemmatization Approaches with Examples in Python
      • Topic modeling visualization – How to present the results of LDA models?
      • Cosine Similarity – Understanding the math and how it works (with python codes)
      • spaCy Tutorial – Complete Writeup
      • Training Custom NER models in SpaCy to auto-detect named entities [Complete Guide]
      • Building chatbot with Rasa and spaCy
      • SpaCy Text Classification – How to Train Text Classification Model in spaCy (Solved Example)?
    • Plots
      • Matplotlib Plotting Tutorial – Complete overview of Matplotlib library
      • Matplotlib Histogram – How to Visualize Distributions in Python
      • Bar Plot in Python – How to compare Groups visually
      • Python Boxplot – How to create and interpret boxplots (also find outliers and summarize distributions)
      • Waterfall Plot in Python
      • Top 50 matplotlib Visualizations – The Master Plots (with full python code)
      • Matplotlib Tutorial – A Complete Guide to Python Plot w/ Examples
      • Matplotlib Pyplot – How to import matplotlib in Python and create different plots
      • Python Scatter Plot – How to visualize relationship between two numeric features
      • Matplotlib Line Plot – How to create a line plot to visualize the trend?
      • Matplotlib Subplots – How to create multiple plots in same figure in Python?
    • Machine Learning
      • Main Pitfalls in Machine Learning Projects
      • Deploy ML model in AWS Ec2 – Complete no-step-missed guide
      • Feature selection using FRUFS and VevestaX
      • Simulated Annealing Algorithm Explained from Scratch (Python)
      • Bias Variance Tradeoff – Clearly Explained
      • Complete Introduction to Linear Regression in R
      • Caret Package – A Practical Guide to Machine Learning in R
      • Logistic Regression – A Complete Tutorial With Examples in R
      • Principal Component Analysis (PCA) – Better Explained
      • K-Means Clustering Algorithm from Scratch
      • How Naive Bayes Algorithm Works? (with example and full code)
      • Feature Selection – Ten Effective Techniques with Examples
      • Evaluation Metrics for Classification Models – How to measure performance of machine learning models?
      • Brier Score – How to measure accuracy of probablistic predictions
      • Portfolio Optimization with Python using Efficient Frontier with Practical Examples
      • Gradient Boosting – A Concise Introduction from Scratch
    • Deployment
      • Population Stability Index (PSI)
      • Deploy ML model in AWS Ec2 – Complete no-step-missed guide
    • Julia
      • Julia – Programming Language
      • Linear Regression in Julia
      • Logistic Regression in Julia – Practical Guide with Examples
      • For-Loop in Julia
      • While-loop in Julia
      • Function in Julia
      • DataFrames in Julia
    • Data Wrangling
      • 101 NumPy Exercises for Data Analysis (Python)
      • 101 Pandas Exercises for Data Analysis
      • SQL Tutorial – A Simple and Intuitive Guide to the Structured Query Language
      • Dask – How to handle large dataframes in python using parallel computing
      • Modin – How to speedup pandas by changing one line of code
      • Python Numpy – Introduction to ndarray [Part 1]
      • data.table in R – The Complete Beginners Guide
      • 101 Python datatable Exercises (pydatatable)
      • 101 R data.table Exercises
      • 101 NLP Exercises (using modern libraries)
    • Recent
      • How to deal with Big Data in Python for ML Projects (100+ GB)?
      • Granger Causality Test
      • Main Pitfalls in Machine Learning Projects
      • Population Stability Index (PSI)
      • Deploy ML model in AWS Ec2 – Complete no-step-missed guide
      • Feature selection using FRUFS and VevestaX
      • Object Oriented Programming (OOPS) in Python
      • Simulated Annealing Algorithm Explained from Scratch (Python)
      • Partial Correlation
      • Chi-Square test – How to test statistical significance for categorical data?
      • Conda virtual environment
  • Pricing
  • Testimonials
  • Product
    • Complete Data Science Course (CDS)
      • Data Science Specializations >
        • DS Programming Specialization
        • Machine Learning Specialization
        • Deployment Specialization
        • Forecasting Specialization
        • DS Projects Specialization
        • Deep Learning Specialization
        • Supplementary Courses
    • Projects
    • Store🛒
Login
  • Getting Started
    • #1. How to formulate machine learning problem
    • #2. Setup Python environment for ML
    • #3. Exploratory Data Analysis (EDA)
    • #4. How to reduce the memory size of Pandas Data frame
    • #5. Missing Data Imputation Approaches
    • #6. Interpolation in Python
    • #7. MICE imputation
    • #8. How to detect outliers using IQR and Boxplots?
    • #9. How to detect outliers with z-score
  • Beginners Corner
    • How to formulate machine learning problem
    • Setup Python environment for ML
    • What is a Data Scientist?
    • The story of how Data Scientists came into existence
    • Task Checklist for Almost Any Machine Learning Project
    • Data Science Roadmap (2023)
    • Why learn the math behind Machine Learning and AI?
    • Mistakes programmers make when starting machine learning
    • Machine Learning Use Cases
    • How to deal with Big Data in Python for ML Projects (100+ GB)?
    • Main Pitfalls in Machine Learning Projects
  • Courses
    • 1. Foundations of Machine Learning
    • 2. Python Programming
    • 3. NumPy for Data Science
    • 4. Pandas for Data Science
    • 5. Linux Command
    • 6. SQL for Data Science – Level 1
    • 7. SQL for Data Science – Level 2
    • 8. SQL for Data Science – Level 3
    • 9. SQL for Data Science – Window Functions
    • 10. Data Pre-processing and EDA
    • 11. Linear regression and regularisation
    • 12. Classification: Logistic Regression
    • 13. Imbalanced Classification
    • 14. Supervised ML Algorithms
    • 15. Ensemble Learning
    • 16. ML Deployment in AWS EC2
    • 17. Deploy in AWS Lamda
    • 18. Deploy in AWS Sagemaker
    • 19. PySpark for Data Science – I: Fundamentals
    • 20. PySpark for Data Science – II: Statistics for Big Data
    • 21. Introduction to Time Series Analaysis
    • 22. Time Series Analysis – I (Beginners)
    • 23. Time Series Analysis – II (Intermediate )
    • 24. Time Series Forecasting Part 1: Statistical Models
    • 25. Time Series Forecasting Part 2: ARIMA modeling and Tests
    • 26. Time Series Forecasting Part 3: Vector Auto Regression
    • 27. Time Series Analysis – III: Singular Spectrum Analysis
    • 28. Feature Engineering for Time Series Project: I
    • 29. Feature Engineering for Time Series Projects: II
    • 31. Estimating customer lifetime value for business
    • 32. Microsoft malware detection project
    • 33. Credit card fraud detection
    • 34. Restaurant Visitor Forecasting
    • 35. Optimizing Marketing Budget Spend with Marketing Mix Modeling
    • 36. Predict Rating given Amazon Product review using NLP
    • 37. Foundations of Deep Learning in Python
    • 38. Foundations of Deep Learning: Part 2
    • 39. Applied Deep Learning with PyTorch
    • 40. Detecting defects in Steel sheet with Computer vision
    • 41. Project Text Generation using Language models with LSTM
    • 42. Project Classifying Sentiment of reviews using BERT NLP
    • 43. Spacy for NLP
    • 44. Base R Programming
    • 45. Dplyr for Data Wrangling
    • 46. Wrangling Data with Data Table
    • 47. GGPlot2 Visualization for Data Analysis
    • 48. Statistical foundation for ML in R
    • 49. Regression Model in R
    • 50. Caret Package in R
  • Python
    • Introduction to Python
      • Setup Python environment for ML
      • Decorators in Python
      • Generators in Python
      • Iterators in Python
      • Python Module
      • Object Oriented Programming (OOPS) in Python
      • List Comprehension
      • Requests in Python
      • Python Collections
      • Python Logging
    • Plots
      • Matplotlib Tutorial
      • Matplotlib Histogram
      • Bar Plot in Python
      • Python Boxplot
      • Waterfall Plot in Python
      • Top 50 matplotlib Visualizations
      • Matplotlib Tutorial
      • Matplotlib Pyplot
      • Python Scatter Plot
      • Matplotlib Subplots
    • Data Wrangling
      • 101 NumPy Exercises for Data Analysis (Python)
      • 101 Pandas Exercises for Data Analysis
      • 101 Pandas Exercises for Data Analysis
      • Dask
      • Modin
      • Numpy Tutorial
      • data.table in R
      • 101 Python datatable Exercises (pydatatable)
      • 101 R data.table Exercises
    • Advanced Python
      • Conda create environment and everything you need to know to manage conda virtual environment
      • Python @Property Explained
      • pdb – How to use Python debugger
      • Python JSON – Guide
      • cProfile – How to profile your python code
      • Python Yield
      • Lambda Function in Python
      • What does Python Global Interpreter Lock
      • Install opencv python
      • Install pip mac
      • Scrapy vs. Beautiful Soup
      • Add Python to PATH
    • PySpark
      • Introduction to Pyspark
      • Power of Pyspark
      • Install PySpark on Windows
      • Install PySpark on MAC
      • Install PySpark on Linux
      • What is Sparksession
      • Read and Write files using PySpark
      • Pyspark Show
      • Run SQL Queries with PySpark
      • PySpark Pandas API
      • Select columns in PySpark dataframe
      • PySpark withColumn()
      • Pyspark Drop Columns
      • PySpark Rename Columns
      • PySpark Filter vs Where
      • PySpark orderBy() and sort()
      • PySpark GroupBy()
      • PySpark Pivot
      • PySpark Joins
      • PySpark Union
      • PySpark Connect to MySQL
      • PySpark Connect to PostgreSQL
      • PySpark Connect to SQL Serve
      • PySpark Connect to Redshift
      • PySpark Connect to Snowflake
      • PySpark Linear Regression
      • PySpark Logistic Regression
      • PySpark Decision Tree
      • PySpark Ridge Regression
      • PySpark Lasso Regression
      • PySpark Random Forest
      • PySpark Gradient Boosting model
      • PySpark Mllib K-Means Clustering
      • PySpark Statistics Mean
      • PySpark Statistics Median
      • PySpark Statistics Mode
      • PySpark Statistics Standard Deviation
      • PySpark Statistics Variance
      • PySpark Statistics Deciles and Quartiles
      • PySpark Correlation
      • PySpark Chi-Square Test
      • PySpark Variable type Identification
      • PySpark Outlier Detection and Treatment
      • PySpark Missing Data Imputation
      • PySpark Variance Inflation Factor (VIF)
      • PySpark StringIndexer
      • PySpark OneHot Encoding
      • PySpark Exercises – 101 PySpark Exercises for Data Analysis
      • Others
        • Deployment
          • Population Stability Index (PSI)
          • Deploy ML model in AWS Ec2
        • Julia
          • Julia – Programming Language
          • Linear Regression in Julia
          • Logistic Regression in Julia
          • For-Loop in Julia
          • While-loop in Julia
          • Function in Julia
          • DataFrames in Julia
        • Linux
          • ls command in Linux – Mastering the “ls” command in Linux
          • mkdir command in Linux – A comprehensive guide for mkdir command
          • cd command in linux – Mastering the ‘cd’ command in Linux
          • cat command in Linux – Mastering the ‘cat’ command in Linux
          • Linux Commands List with Examples
  • Machine Learning
    • Deep Learning
      • TensorFlow vs PyTorch
      • How to use tf.function to speed up Python code in Tensorflow
      • How to implement Linear Regression in TensorFlow
    • NLP
      • Complete Guide to Natural Language Processing (NLP)
      • Text Summarization Approaches for NLP
      • 101 NLP Exercises (using modern libraries)
      • Gensim Tutorial
      • LDA in Python
      • Topic Modeling with Gensim (Python)
      • Lemmatization Approaches with Examples in Python
      • Topic modeling visualization
      • Cosine Similarity
      • spaCy Tutorial
      • Training Custom NER models in SpaCy to auto-detect named entities
      • Building chatbot with Rasa and spaCy
      • SpaCy Text Classification
    • Algorithms
      • K-Means Clustering Algorithm from Scratch
      • Simulated Annealing Algorithm Explained from Scratch
      • How Naive Bayes Algorithm Works?
      • Feature selection using FRUFS and VevestaX
      • Principal Component Analysis
      • Gradient Boosting
      • Feature Selection – Ten Effective Techniques with Examples
    • Projects
      • Evaluation Metrics for Classification Models
      • Deploy ML model in AWS Ec2
      • Portfolio Optimization with Python using Efficient Frontier
      • Bias Variance Tradeoff
    • Specific Topics
      • Logistic Regression
      • Complete Introduction to Linear Regression in R
      • Caret Package
      • Brier Score
  • Time Series
    • Granger Causality Test
    • Augmented Dickey Fuller Test (ADF Test)
    • KPSS Test for Stationarity
    • ARIMA Model
    • Time Series Analysis in Python
    • Vector Autoregression (VAR)
  • Prob and Stats
    • Probability
      • Introduction to Probability
      • Odds and Odds Ratios
      • Independent and Dependent Events
      • Mutually Exclusive Events
      • Joint Probability
      • Conditional Probability
      • Bayes’ Theorem
      • Expected Value
      • Probability frequency distribution
      • Discrete Frequency Distributions
      • Continuous Frequency Distributions
    • Partial Correlation
    • Chi-Square Test – Theory & Math
    • Gentle Introduction to Markov Chain
    • What is P-Value?
    • How to implement common statistical significance tests and find the p value?
    • Mahalanobis Distance
    • T Test (Students T Test)
    • Confidence Interval in Statistics
    • Standard Error in Statistics
    • One Sample T Test
    • Descriptive and inferential statistics
    • Types of data in statistics
    • Measures of central tendency
    • Quantiles and Percentiles
    • Measures of dispersion
    • Skewness and kurtosis
    • Central Limit Theroem
    • Law of large numbers
    • Standard Error
    • Sampling and sampling distributions
    • Correlation
  • SQL
    • SQL Tutorial – The Introduction
    • SQL Subquery (advanced)
    • SQL Window Functions (advanced)
    • SQL Window Functions Exercises – Set 1
    • SQL Window Functions Exercises – Set 2
    • Intro to SQL
    • SQL Select
    • SQL Select Distinct
    • SQL Where
    • SQL Order by
    • SQL Insert Into
    • SQL AND, OR, and NOT
    • SQL Null Values
    • SQL Update
    • SQL DELETE
    • SQL SELECT TOP
    • SQL MIN and MAX Functions
    • SQL Count(), Avg(), Sum()
    • SQL LIKE
    • SQL Wildcards
    • SQL IN
    • SQL BETWEEN
    • SQL Aliases
    • SQL Joins
    • SQL Inner Join
    • SQL Left Join
    • SQL Right Join
    • SQL Full Join
    • SQL Self Join
    • SQL UNION
    • SQL GROUP BY
    • SQL HAVING
    • SQL EXISTS
    • SQL ANY, ALL Operators
    • How to transpose columns to rows in SQL?
    • How to select only rows with max value on a column?
    • SQL Select Into
    • SQL Insert Into Select
    • SQL Case
    • SQL Null Functions
    • SQL Comments
    • SQL Operators
    • SQL Create Table
    • SQL Drop Table
    • SQL Primary Key
    • SQL Foreign Key
    • Sort multiple columns in SQL and in different directions?
    • Count the number of work days between two dates?
    • Compute maximum of multiple columns, aks row wise max?
    • GROUP BY clause on multiple columns in SQL?
  • Linear Algebra
    • 01. Introduction to Linear Algebra
    • 02. Types of Tensors
    • 03. Scalars
    • 04. Vectors
    • 05. Vectors Linear Algebra
    • 06. Matrix Types
    • 07. Matrix Operations
    • 08. Orthogonal and Ortrhonormal Matrix
    • 09. Eigenvectors and Eigenvalues
    • 10. Affine Transformation
    • 11. Singular Value Decomposition (SVD)
    • 12. System of Equations
    • 13. Linear Regression Algorithm
    • 14. Principal Component Analysis
Menu
  • Getting Started
    • #1. How to formulate machine learning problem
    • #2. Setup Python environment for ML
    • #3. Exploratory Data Analysis (EDA)
    • #4. How to reduce the memory size of Pandas Data frame
    • #5. Missing Data Imputation Approaches
    • #6. Interpolation in Python
    • #7. MICE imputation
    • #8. How to detect outliers using IQR and Boxplots?
    • #9. How to detect outliers with z-score
  • Beginners Corner
    • How to formulate machine learning problem
    • Setup Python environment for ML
    • What is a Data Scientist?
    • The story of how Data Scientists came into existence
    • Task Checklist for Almost Any Machine Learning Project
    • Data Science Roadmap (2023)
    • Why learn the math behind Machine Learning and AI?
    • Mistakes programmers make when starting machine learning
    • Machine Learning Use Cases
    • How to deal with Big Data in Python for ML Projects (100+ GB)?
    • Main Pitfalls in Machine Learning Projects
  • Courses
    • 1. Foundations of Machine Learning
    • 2. Python Programming
    • 3. NumPy for Data Science
    • 4. Pandas for Data Science
    • 5. Linux Command
    • 6. SQL for Data Science – Level 1
    • 7. SQL for Data Science – Level 2
    • 8. SQL for Data Science – Level 3
    • 9. SQL for Data Science – Window Functions
    • 10. Data Pre-processing and EDA
    • 11. Linear regression and regularisation
    • 12. Classification: Logistic Regression
    • 13. Imbalanced Classification
    • 14. Supervised ML Algorithms
    • 15. Ensemble Learning
    • 16. ML Deployment in AWS EC2
    • 17. Deploy in AWS Lamda
    • 18. Deploy in AWS Sagemaker
    • 19. PySpark for Data Science – I: Fundamentals
    • 20. PySpark for Data Science – II: Statistics for Big Data
    • 21. Introduction to Time Series Analaysis
    • 22. Time Series Analysis – I (Beginners)
    • 23. Time Series Analysis – II (Intermediate )
    • 24. Time Series Forecasting Part 1: Statistical Models
    • 25. Time Series Forecasting Part 2: ARIMA modeling and Tests
    • 26. Time Series Forecasting Part 3: Vector Auto Regression
    • 27. Time Series Analysis – III: Singular Spectrum Analysis
    • 28. Feature Engineering for Time Series Project: I
    • 29. Feature Engineering for Time Series Projects: II
    • 31. Estimating customer lifetime value for business
    • 32. Microsoft malware detection project
    • 33. Credit card fraud detection
    • 34. Restaurant Visitor Forecasting
    • 35. Optimizing Marketing Budget Spend with Marketing Mix Modeling
    • 36. Predict Rating given Amazon Product review using NLP
    • 37. Foundations of Deep Learning in Python
    • 38. Foundations of Deep Learning: Part 2
    • 39. Applied Deep Learning with PyTorch
    • 40. Detecting defects in Steel sheet with Computer vision
    • 41. Project Text Generation using Language models with LSTM
    • 42. Project Classifying Sentiment of reviews using BERT NLP
    • 43. Spacy for NLP
    • 44. Base R Programming
    • 45. Dplyr for Data Wrangling
    • 46. Wrangling Data with Data Table
    • 47. GGPlot2 Visualization for Data Analysis
    • 48. Statistical foundation for ML in R
    • 49. Regression Model in R
    • 50. Caret Package in R
  • Python
    • Introduction to Python
      • Setup Python environment for ML
      • Decorators in Python
      • Generators in Python
      • Iterators in Python
      • Python Module
      • Object Oriented Programming (OOPS) in Python
      • List Comprehension
      • Requests in Python
      • Python Collections
      • Python Logging
    • Plots
      • Matplotlib Tutorial
      • Matplotlib Histogram
      • Bar Plot in Python
      • Python Boxplot
      • Waterfall Plot in Python
      • Top 50 matplotlib Visualizations
      • Matplotlib Tutorial
      • Matplotlib Pyplot
      • Python Scatter Plot
      • Matplotlib Subplots
    • Data Wrangling
      • 101 NumPy Exercises for Data Analysis (Python)
      • 101 Pandas Exercises for Data Analysis
      • 101 Pandas Exercises for Data Analysis
      • Dask
      • Modin
      • Numpy Tutorial
      • data.table in R
      • 101 Python datatable Exercises (pydatatable)
      • 101 R data.table Exercises
    • Advanced Python
      • Conda create environment and everything you need to know to manage conda virtual environment
      • Python @Property Explained
      • pdb – How to use Python debugger
      • Python JSON – Guide
      • cProfile – How to profile your python code
      • Python Yield
      • Lambda Function in Python
      • What does Python Global Interpreter Lock
      • Install opencv python
      • Install pip mac
      • Scrapy vs. Beautiful Soup
      • Add Python to PATH
    • PySpark
      • Introduction to Pyspark
      • Power of Pyspark
      • Install PySpark on Windows
      • Install PySpark on MAC
      • Install PySpark on Linux
      • What is Sparksession
      • Read and Write files using PySpark
      • Pyspark Show
      • Run SQL Queries with PySpark
      • PySpark Pandas API
      • Select columns in PySpark dataframe
      • PySpark withColumn()
      • Pyspark Drop Columns
      • PySpark Rename Columns
      • PySpark Filter vs Where
      • PySpark orderBy() and sort()
      • PySpark GroupBy()
      • PySpark Pivot
      • PySpark Joins
      • PySpark Union
      • PySpark Connect to MySQL
      • PySpark Connect to PostgreSQL
      • PySpark Connect to SQL Serve
      • PySpark Connect to Redshift
      • PySpark Connect to Snowflake
      • PySpark Linear Regression
      • PySpark Logistic Regression
      • PySpark Decision Tree
      • PySpark Ridge Regression
      • PySpark Lasso Regression
      • PySpark Random Forest
      • PySpark Gradient Boosting model
      • PySpark Mllib K-Means Clustering
      • PySpark Statistics Mean
      • PySpark Statistics Median
      • PySpark Statistics Mode
      • PySpark Statistics Standard Deviation
      • PySpark Statistics Variance
      • PySpark Statistics Deciles and Quartiles
      • PySpark Correlation
      • PySpark Chi-Square Test
      • PySpark Variable type Identification
      • PySpark Outlier Detection and Treatment
      • PySpark Missing Data Imputation
      • PySpark Variance Inflation Factor (VIF)
      • PySpark StringIndexer
      • PySpark OneHot Encoding
      • PySpark Exercises – 101 PySpark Exercises for Data Analysis
      • Others
        • Deployment
          • Population Stability Index (PSI)
          • Deploy ML model in AWS Ec2
        • Julia
          • Julia – Programming Language
          • Linear Regression in Julia
          • Logistic Regression in Julia
          • For-Loop in Julia
          • While-loop in Julia
          • Function in Julia
          • DataFrames in Julia
        • Linux
          • ls command in Linux – Mastering the “ls” command in Linux
          • mkdir command in Linux – A comprehensive guide for mkdir command
          • cd command in linux – Mastering the ‘cd’ command in Linux
          • cat command in Linux – Mastering the ‘cat’ command in Linux
          • Linux Commands List with Examples
  • Machine Learning
    • Deep Learning
      • TensorFlow vs PyTorch
      • How to use tf.function to speed up Python code in Tensorflow
      • How to implement Linear Regression in TensorFlow
    • NLP
      • Complete Guide to Natural Language Processing (NLP)
      • Text Summarization Approaches for NLP
      • 101 NLP Exercises (using modern libraries)
      • Gensim Tutorial
      • LDA in Python
      • Topic Modeling with Gensim (Python)
      • Lemmatization Approaches with Examples in Python
      • Topic modeling visualization
      • Cosine Similarity
      • spaCy Tutorial
      • Training Custom NER models in SpaCy to auto-detect named entities
      • Building chatbot with Rasa and spaCy
      • SpaCy Text Classification
    • Algorithms
      • K-Means Clustering Algorithm from Scratch
      • Simulated Annealing Algorithm Explained from Scratch
      • How Naive Bayes Algorithm Works?
      • Feature selection using FRUFS and VevestaX
      • Principal Component Analysis
      • Gradient Boosting
      • Feature Selection – Ten Effective Techniques with Examples
    • Projects
      • Evaluation Metrics for Classification Models
      • Deploy ML model in AWS Ec2
      • Portfolio Optimization with Python using Efficient Frontier
      • Bias Variance Tradeoff
    • Specific Topics
      • Logistic Regression
      • Complete Introduction to Linear Regression in R
      • Caret Package
      • Brier Score
  • Time Series
    • Granger Causality Test
    • Augmented Dickey Fuller Test (ADF Test)
    • KPSS Test for Stationarity
    • ARIMA Model
    • Time Series Analysis in Python
    • Vector Autoregression (VAR)
  • Prob and Stats
    • Probability
      • Introduction to Probability
      • Odds and Odds Ratios
      • Independent and Dependent Events
      • Mutually Exclusive Events
      • Joint Probability
      • Conditional Probability
      • Bayes’ Theorem
      • Expected Value
      • Probability frequency distribution
      • Discrete Frequency Distributions
      • Continuous Frequency Distributions
    • Partial Correlation
    • Chi-Square Test – Theory & Math
    • Gentle Introduction to Markov Chain
    • What is P-Value?
    • How to implement common statistical significance tests and find the p value?
    • Mahalanobis Distance
    • T Test (Students T Test)
    • Confidence Interval in Statistics
    • Standard Error in Statistics
    • One Sample T Test
    • Descriptive and inferential statistics
    • Types of data in statistics
    • Measures of central tendency
    • Quantiles and Percentiles
    • Measures of dispersion
    • Skewness and kurtosis
    • Central Limit Theroem
    • Law of large numbers
    • Standard Error
    • Sampling and sampling distributions
    • Correlation
  • SQL
    • SQL Tutorial – The Introduction
    • SQL Subquery (advanced)
    • SQL Window Functions (advanced)
    • SQL Window Functions Exercises – Set 1
    • SQL Window Functions Exercises – Set 2
    • Intro to SQL
    • SQL Select
    • SQL Select Distinct
    • SQL Where
    • SQL Order by
    • SQL Insert Into
    • SQL AND, OR, and NOT
    • SQL Null Values
    • SQL Update
    • SQL DELETE
    • SQL SELECT TOP
    • SQL MIN and MAX Functions
    • SQL Count(), Avg(), Sum()
    • SQL LIKE
    • SQL Wildcards
    • SQL IN
    • SQL BETWEEN
    • SQL Aliases
    • SQL Joins
    • SQL Inner Join
    • SQL Left Join
    • SQL Right Join
    • SQL Full Join
    • SQL Self Join
    • SQL UNION
    • SQL GROUP BY
    • SQL HAVING
    • SQL EXISTS
    • SQL ANY, ALL Operators
    • How to transpose columns to rows in SQL?
    • How to select only rows with max value on a column?
    • SQL Select Into
    • SQL Insert Into Select
    • SQL Case
    • SQL Null Functions
    • SQL Comments
    • SQL Operators
    • SQL Create Table
    • SQL Drop Table
    • SQL Primary Key
    • SQL Foreign Key
    • Sort multiple columns in SQL and in different directions?
    • Count the number of work days between two dates?
    • Compute maximum of multiple columns, aks row wise max?
    • GROUP BY clause on multiple columns in SQL?
  • Linear Algebra
    • 01. Introduction to Linear Algebra
    • 02. Types of Tensors
    • 03. Scalars
    • 04. Vectors
    • 05. Vectors Linear Algebra
    • 06. Matrix Types
    • 07. Matrix Operations
    • 08. Orthogonal and Ortrhonormal Matrix
    • 09. Eigenvectors and Eigenvalues
    • 10. Affine Transformation
    • 11. Singular Value Decomposition (SVD)
    • 12. System of Equations
    • 13. Linear Regression Algorithm
    • 14. Principal Component Analysis

Python Regular Expressions Tutorial and Examples: A Simplified Guide

Join thousands of students who advanced their careers with MachineLearningPlus. Go from Beginner to Data Science Expert through a structured road map of 70+ courses in 9 core specializations. Build industry grade Data Science projects.

Learn more
  • January 20, 2018
  • Selva Prabhakaran

Regular expressions, also called regex, is a syntax or rather a language to search, extract and manipulate specific string patterns from a larger text. It is widely used in projects that involve text validation, NLP and text mining.

Regular Expressions in Python: A Simplified Tutorial. Photo by Sarah Crutchfield.

1. Contents

  1. Introduction to regular expressions
  2. What is a regex pattern and how to compile one?
  3. How to split a string separated by a regex?
  4. Finding pattern matches using findall, search and match
  5. What does re.findall() do?
    5.1. re.search() vs re.match()
  6. How to substitute one text with another using regex?
  7. Regex groups
  8. What is greedy matching in regex?
  9. Most common regular expression syntax and patterns
  10. Regular Expressions Examples
    10.1. Any character except for a new line
    10.2. A period
    10.3. Any digit
    10.4. Anything but a digit
    10.5. Any character, including digits
    10.6. Anything but a character
    10.7. Collection of characters
    10.8. Match something upto ‘n’ times
    10.9. Match 1 or more occurrences
    10.10. Match any number of occurrences (0 or more times)
    10.11. Match exactly zero or one occurrence
    10.12. Match word boundaries
  11. Practice Exercises
  12. Conclusion

1. Introduction to regular expressions

Regular expressions, also called regex is implemented in pretty much every computer language. In python, it is implemented in the standard module re.

It is widely used in natural language processing, web applications that require validating string input (like email address) and pretty much most data science projects that involve text mining. This post is structured into 2 parts.

Before getting to the regular expressions syntax, it’s better for you to first understand how the re module works.

So, you will first get introduced to the 5 main features of the `re“ module and then see how to create commonly used regular expressions in python.

You will see how to construct pretty much any string pattern you will likely need when working on text mining related projects.

2. What is a regex pattern and how to compile one?

A regex pattern is a special language used to represent generic text, numbers or symbols so it can be used to extract texts that conform to that pattern.

A basic example is '\s+'. Here the '\s' matches any whitespace character.

By adding a '+' notation at the end will make the pattern match at least 1 or more spaces.

So, this pattern will match even tab '\t' characters as well. A larger list of regex patterns comes at the end of this post. But before getting to that, let’s see how to compile and play with regular expressions.

import re   
regex = re.compile('\s+')

The above code imports the 're' package and compiles a regular expression pattern that can match at least one or more space characters.

3. How to split a string separated by a regex?

Let’s consider the following piece of text.

text = """101 COM    Computers
205 MAT   Mathematics
189 ENG   English"""

I have three course items in the following format: “[Course Number] [Course Code] [Course Name]”.

The spacing between the words are not equal.

I want to split these three course items into individual units of numbers and words.

How to do that?

This can be split in two ways:

1. By using the re.split method.
2. By calling the split method of the regex object.

# split the text around 1 or more space characters 
re.split('\s+', text) # or 
regex.split(text) 
#> ['101', 'COM', 'Computers', '205', 'MAT', 'Mathematics', '189', 'ENG', 'English']

So both these methods work. But which one to use in practice?

If you intend to use a particular pattern multiple times, then you are better off compiling a regular expression rather than using re.split over and over again.

4. Finding pattern matches using findall, search and match

Let’s suppose you want to extract all the course numbers, that is, the numbers 101, 205 and 189 alone from the above text. How to do that?

4.1 What does re.findall() do?

# find all numbers within the text print(text) regex_num = re.compile('\d+') regex_num.findall(text) #> 101 COM    Computers
#> 205 MAT   Mathematics
#> 189 ENG   English
#> ['101', '205', '189']

In above code, the special character '\d' is a regular expression which matches any digit.

I will be covering more such patterns in later in this tutorial.

Adding a '+' symbol to it mandates the presence of at least 1 digit to be present in order to be found.

Similar to '+', there is a '*' symbol which requires 0 or more digits in order to be found.

It practically makes the presence of a digit optional in order to make a match. More on this later. Finally, the findall method extracts all occurrences of the 1 or more digits from the text and returns them in a list.

4.2 re.search() vs re.match()

As the name suggests, regex.search() searches for the pattern in a given text.

But unlike findall which returns the matched portions of the text as a list, regex.search() returns a particular match object that contains the starting and ending positions of the first occurrence of the pattern. Likewise, regex.match() also returns a match object.

But the difference is, it requires the pattern to be present at the beginning of the text itself.

# define the text 
text2 = """COM Computers 205 MAT Mathematics 189""" 
# compile the regex and search the pattern 
regex_num = re.compile('\d+') 
s = regex_num.search(text2) print('Starting Position: ', s.start()) 
print('Ending Position: ', s.end()) 
print(text2[s.start():s.end()]) 

#> Starting Position:  17
#> Ending Position:  20
#> 205

Alternately, you can get the same output using the group() method of the match object.

print(s.group())
#> 205
m = regex_num.match(text2)
print(m)

#> None

 

5. How to substitute one text with another using regex?

To replace texts, use the regex.sub().

Let’s consider the following modified version of the courses text. Here I have added an extra tab after each course code.

# define the text 
text = """101 COM \t Computers 205 MAT \t Mathematics 189 ENG \t English""" 
print(text) 

#> 101   COM      Computers
#> 205   MAT      Mathematics
#> 189   ENG      English

From the above text, I want to even out all the extra spaces and put all the words in one single line. To do this, you just have to use regex.sub to replace the '\s+' pattern with a single space ‘ ‘.

# replace one or more spaces with single space 
regex = re.compile('\s+') print(regex.sub(' ', text)) 
# or print(re.sub('\s+', ' ', text)) 
#> 101 COM Computers 205 MAT Mathematics 189 ENG English

Suppose you only want to get rid of the extra spaces but want to keep the course entries in the new line itself.

To achieve that you should use a regex that effectively excludes new line characters but includes all other whitespaces.

This can be done using a negative lookahead (?!\n). It checks for an upcoming newline character and excludes it from the pattern.

# get rid of all extra spaces except newline 
regex = re.compile('((?!\n)\s+)') print(regex.sub(' ', text)) 

#> 101 COM Computers
#> 205 MAT Mathematics
#> 189 ENG English

 

6. Regex groups

Regular expression groups is a very useful feature that lets you extract the desired match objects as individual items. Suppose I want to extract the course number, code and the name as separate items. Without groups, I will have to write something like this.

text = """101   COM   Computers
205   MAT   Mathematics
189   ENG    English"""  


# 1. extract all course numbers
 re.findall('[0-9]+', text) 
# 2. extract all course codes 
re.findall('[A-Z]{3}', text) 
# 3. extract all course names re.findall('[A-Za-z]{4,}', text) 

#> ['101', '205', '189']
#> ['COM', 'MAT', 'ENG']
#> ['Computers', 'Mathematics', 'English']

Well, let’s see what just happened.

I compiled 3 separate regular expressions one each for matching the course number, code and the name.

For course number, the pattern [0-9]+ instructs to match all number from 0 to 9.

Adding a + symbol at the end makes it look for at least 1 occurrence of numbers 0-9. If you know the course number will certainly have exactly 3 digits, the pattern could have been [0-9]{3} instead.

For course code, you can guess that '[A-Z]{3}' will match exactly 3 consequtive occurrences of alphabets capital A-Z.

For course name, '[A-Za-z]{4,}' will look for upper and lower case alphabets a-z, assuming all course names will have at least 4 or more characters.

Can you guess what would be the pattern if the maximum limit of characters in course name is say, 20?

Now I had to write 3 separate lines to get the individual items.

But there is a better way: Regex Groups.

Since all the entries have the same pattern, you can construct a unified pattern for the entire course entry and put the portions you want to extract inside a pair of brackets ().

# define the course text pattern groups and extract 
course_pattern = '([0-9]+)\s*([A-Z]{3})\s*([A-Za-z]{4,})' re.findall(course_pattern, text) 
#> [('101', 'COM', 'Computers'), ('205', 'MAT', 'Mathematics'), ('189', 'ENG', 'English')]

Notice the patterns for the course num: [0-9]+, code: [A-Z]{3} and name: [A-Za-z]{4,} are all placed inside parenthesis () in order to form the groups.

7. What is greedy matching in regex?

The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.

Let’s see an example of a piece of HTML, where I want to retrieve the HTML tag.

text = "< body>Regex Greedy Matching Example < /body>"
re.findall('<.*>', text)
#> ['< body>Regex Greedy Matching Example < /body>']

Instead of matching till the first occurrence of ‘>’, which I was hoping would happen at the end of first body tag itself, it extracted the whole string.

This is the default greedy or ‘take it all’ behavior of regex. Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a `?` at the end of the pattern.

re.findall('<.*?>', text)
#> ['< body>', '< /body>']

If you want only the first match to be retrieved, use the search method instead.

re.search('<.*?>', text).group()
#> '< body>'

 

8. Most common regular expression syntax and patterns

Now that you understand the how to use the re module.
Let’s see some commonly used wildcard patterns.

Basic Syntax
.             One character except new line
\.            A period. \ escapes a special character.
\d            One digit
\D            One non-digit
\w            One word character including digits
\W            One non-word character
\s            One whitespace
\S            One non-whitespace
\b            Word boundary
\n            Newline
\t            Tab

Modifiers
$             End of string
^             Start of string
ab|cd         Matches ab or de.
[ab-d]        One character of: a, b, c, d
[^ab-d]       One character except: a, b, c, d
()            Items within parenthesis are retrieved
(a(bc))       Items within the sub-parenthesis are retrieved

Repetitions
[ab]{2}       Exactly 2 continuous occurrences of a or b
[ab]{2,5}     2 to 5 continuous occurrences of a or b
[ab]{2,}      2 or more continuous occurrences of a or b
+             One or more
*             Zero or more
?             0 or 1

9. Regular Expressions Examples

9.1. Any character except for a new line

text = 'machinelearningplus.com'
print(re.findall('.', text))  # .   Any character except for a new line 
print(re.findall('...', text)) #> ['m', 'a', 'c', 'h', 'i', 'n', 'e', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', 'p', 'l', 'u', 's', '.', 'c', 'o', 'm']
#> ['mac', 'hin', 'ele', 'arn', 'ing', 'plu', 's.c']

9.2. A period

text = 'machinelearningplus.com'
print(re.findall('\.', text))  # matches a period 
print(re.findall('[^\.]', text)) # matches anything but a period 
#> ['.']
#> ['m', 'a', 'c', 'h', 'i', 'n', 'e', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', 'p', 'l', 'u', 's', 'c', 'o', 'm']

9.3. Any digit

text = '01, Jan 2015'
print(re.findall('\d+', text))  # \d  Any digit. The + mandates at least 1 digit. 
#> ['01', '2015']

9.4. Anything but a digit

text = '01, Jan 2015'
print(re.findall('\D+', text))  # \D  Anything but a digit #> [', Jan ']

9.5. Any character, including digits

text = '01, Jan 2015'
print(re.findall('\w+', text))  # \w  Any character 
#> ['01', 'Jan', '2015']

9.6. Anything but a character

text = '01, Jan 2015'
print(re.findall('\W+', text))  # \W  Anything but a character 
#> [', ', ' ']

9.7. Collection of characters

text = '01, Jan 2015'
print(re.findall('[a-zA-Z]+', text))  # [] Matches any character inside 
#> ['Jan']

9.8. Match something upto ‘n’ times

text = '01, Jan 2015'
print(re.findall('\d{4}', text))  # {n} Matches repeat n times. 
print(re.findall('\d{2,4}', text)) 
#> ['2015']
#> ['01', '2015']

9.9. Match 1 or more occurrences

print(re.findall(r'Co+l', 'So Cooool'))  # Match for 1 or more occurrences 
#> ['Cooool']

9.10. Match any number of occurrences (0 or more times)

print(re.findall(r'Pi*lani', 'Pilani'))
#> ['Pilani']

9.11. Match exactly zero or one occurrence

print(re.findall(r'colou?r', 'color'))
['color']

 

9.12. Match word boundaries

Word boundaries \b are commonly used to detect and match the beginning or end of a word. That is, one side is a word character and the other side is whitespace and vice versa. For example, the regex \btoy will match the ‘toy’ in ‘toy cat’ and not in ‘tolstoy’. In order to match the ‘toy’ in ‘tolstoy’, you should use toy\b Can you come up with a regex that will match only the first ‘toy’ in ‘play toy broke toys’? (hint: \b on both sides) Likewise, \B will match any non-boundary. For example, \Btoy\B will match ‘toy’ surrounded by words on both sides, as in, ‘antoynet’.

re.findall(r'\btoy\b', 'play toy broke toys')  # match toy with boundary on both sides #> ['toy']

 

10. Practice Exercises

Let’s get some practice.

It’s time to open up your python console.

  1. Extract the user id, domain name and suffix from the following email addresses.
emails = """zuck26@facebook.com
page33@google.com
jeff42@amazon.com"""

desired_output = [('zuck26', 'facebook', 'com'),
 ('page33', 'google', 'com'),
 ('jeff42', 'amazon', 'com')]
 
pattern = r'(\w+)@([A-Z0-9]+)\.([A-Z]{2,4})'
re.findall(pattern, emails, flags=re.IGNORECASE)
#> [('zuck26', 'facebook', 'com'),
 ('page33', 'google', 'com'),
 ('jeff42', 'amazon', 'com')]

Use groups with (). There are more sophisticated patterns for matching the email domain and suffix. This is just one version of the answer. [/tab][/tabs]

# Solution pattern = r'(\w+)@([A-Z0-9]+)\.([A-Z]{2,4})' re.findall(pattern, emails, flags=re.IGNORECASE) #>  [('zuck26', 'facebook', 'com'),
#>  ('page33', 'google', 'com'),
#>  ('jeff42', 'amazon', 'com')]  
 
# There are more sophisticated patterns for matching the email domain and suffix. This is just one version of the answer.

2. Retrieve all the words starting with ‘b’ or ‘B’ from the following text.

text = """Betty bought a bit of butter, But the butter was so bitter, So she bought some better butter, To make the bitter butter better."""
import re
re.findall(r'\bB\w+', text, flags=re.IGNORECASE)
#> ['Betty', 'bought', 'bit', 'butter', 'But', 'butter', 'bitter', 'bought', 'better', 'butter', 'bitter', 'butter', 'better']

‘\b’ mandates the left of ‘B’ is a word boundary, effectively requiring the word to start with ‘B’. Setting ‘flags’ arg to ‘re.IGNORECASE’ makes the pattern case insensitive.[/tab][/tabs]

# Solution:  
import re 
re.findall(r'\bB\w+', text, flags=re.IGNORECASE) 
#> ['Betty', 'bought', 'bit', 'butter', 'But', 'butter', 'bitter', 'bought', 'better', 'butter', 'bitter', 'butter', 'better'] 

# '\b' mandates the left of 'B' is a word boundary, effectively requiring the word to start with 'B'. 
# Setting 'flags' arg to 're.IGNORECASE' makes the pattern case insensitive.

3. Split the following irregular sentence into words

sentence = """A, very   very; irregular_sentence"""
desired_output = "A very very irregular sentence"
import re
" ".join(re.split('[;,\s_]+', sentence))
'A very very irregular sentence'

Add more delimiters into the pattern as needed.[/tab][/tabs]

# Solution import re " ".join(re.split('[;,\s_]+', sentence)) #> 'A very very irregular sentence' # Add more delimiters into the pattern as needed.

4. Clean up the following tweet so that it contains only the user’s message. That is, remove all URLs, hashtags, mentions, punctuations, RTs and CCs.

tweet = '''Good advice! RT @TheNextWeb: What I would do differently if I was learning to code today http://t.co/lbwej0pxOd cc: @garybernhardt #rstats'''

desired_output = 'Good advice What I would do differently if I was learning to code today'
import re

def clean_tweet(tweet):
    tweet = re.sub('http\S+\s*', '', tweet)  # remove URLs
    tweet = re.sub('RT|cc', '', tweet)  # remove RT and cc
    tweet = re.sub('#\S+', '', tweet)  # remove hashtags
    tweet = re.sub('@\S+', '', tweet)  # remove mentions
    tweet = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), '', tweet)  # remove punctuations
    tweet = re.sub('\s+', ' ', tweet)  # remove extra whitespace
    return tweet

print(clean_tweet(tweet))
#> Good advice What I would do differently if I was learning to code today 
[/tab][/tabs]
# Solution import re def clean_tweet(tweet): tweet = re.sub('http\S+\s*', '', tweet) # remove URLs tweet = re.sub('RT|cc', '', tweet) # remove RT and cc tweet = re.sub('#\S+', '', tweet) # remove hashtags tweet = re.sub('@\S+', '', tweet) # remove mentions tweet = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), '', tweet) # remove punctuations tweet = re.sub('\s+', ' ', tweet) # remove extra whitespace return tweet print(clean_tweet(tweet)) #> Good advice What I would do differently if I was learning to code today 

5. Extract all the text portions between the tags from the following HTML page: https://raw.githubusercontent.com/selva86/datasets/master/sample.html Code to retrieve the HTML page:

import requests
r = requests.get("https://raw.githubusercontent.com/selva86/datasets/master/sample.html")
r.text  # html text is contained here 

desired_output = ['Your Title Here', 'Link Name', 'This is a Header', 'This is a Medium Header', 'This is a new paragraph! ', 'This is a another paragraph!', 'This is a new sentence without a paragraph break, in bold italics.']
re.findall('<.*?>(.*)< /.*?>', r.text) # remove the space after < and /.*> for the pattern to work 
#> ['Your Title Here', 'Link Name', 'This is a Header', 'This is a Medium Header', 'This is a new paragraph! ', 'This is a another paragraph!', 'This is a new sentence without a paragraph break, in bold italics.']
[/tab][/tabs]
# Solution:
# Note: remove the space after < and /.*> for the pattern to work re.findall('<.*?>(.*)< /.*?>', r.text) #> ['Your Title Here', 'Link Name', 'This is a Header', 'This is a Medium Header', 'This is a new paragraph! ', 'This is a another paragraph!', 'This is a new sentence without a paragraph break, in bold italics.']

 

11. Conclusion

I hope you enjoyed reading this.

The purpose of this post was to get you introduced to regular expressions in a simplified way which you remember. Plus, also something you can use as a future reference.

More Articles

  • Python

Python Exercises – Level 1

Oct 02, 2024
  • Python

How to convert Python code to Cython (and speed up 100x)?

Oct 07, 2023
  • Python

How to convert Python to Cython inside Jupyter Notebooks?

Oct 06, 2023
  • Python

Install opencv python – A Comprehensive Guide to Installing “OpenCV-Python”

Mar 24, 2023
  • Python

install pip mac – How to install pip in MacOS?: A Comprehensive Guide

Mar 23, 2023
  • Python

Scrapy vs. Beautiful Soup: Which is better for web scraping?

Similar Articles

Complete Introduction to Linear Regression in R

Selva Prabhakaran 12/03/2017 7 Comments
Read More »

How to implement common statistical significance tests and find the p value?

Selva Prabhakaran 13/03/2017 3 Comments
Read More »

Logistic Regression – A Complete Tutorial With Examples in R

Selva Prabhakaran 13/09/2017 24 Comments
Read More »

Subscribe to Machine Learning Plus for high value data science content

Linkedin Twitter Youtube Instagram
  • Resources
  • Blogs
  • Courses
  • Store
  • List of Blogs
Menu
  • Resources
  • Blogs
  • Courses
  • Store
  • List of Blogs
  • Project Bluebook
  • Time Series Template
Menu
  • Project Bluebook
  • Time Series Template
  • About us
  • Terms of Use
  • Privacy Policy
  • Contact Us
  • Refund Policy
Menu
  • About us
  • Terms of Use
  • Privacy Policy
  • Contact Us
  • Refund Policy

© Machinelearningplus. All rights reserved.

  • 01-What is Machine Learning Model
  • 02-Data in ML (Garbage in Garbage Out)
  • 03-Types of ML problems
  • 04-Types of ML Problems Part 2
  • 05-Types of ML Problems Part-3
  • 06-Sales and Marketing Use Cases
  • 07-Logistics, production, HR & customer support use cases
  • 08-What ML Can and Cannot Do
  • 09-Data Science vs ML vs AI vs Deep Learning vs Statistical Modeling
  • 10-Introduction to ML Project Workflow
  • 11-Discover
  • 12-Design
  • 13-Develop
  • 14-Testing
  • 15-Deploy
  • 16-Interpreting ML Models
  • 17-Interpreting ML Models Part-1
  • 18-Interpreting ML Models Part-2
  • 19-How to Validate ML Models
  • 20-Need for Validation Sample
  • 21-ML Terminology Part-1
  • 22-ML Terminology Part-2
  • 23-ML Terminology Part-3
  • 24-What is Ensemble Learning
  • 25-Reinforcement Learning Intuition
  • 26-Basic Statistical Concepts Part-1
  • 27-Basic Statistical Concepts Part-2
  • 28- Role of Significance Tests
  • About us
  • Arima
    • 1-Understanding ARIMA
    • 2-Building AR Model
    • 3-Building MA Model
    • 4-Implement ARIMA
    • 5-Forecast with ARIMA and Test Results
  • Blog
  • Computer Vision Case Study
  • Contact Us
  • Demo Videos
    • Chi Square Test
    • Exploratory Data Analysis – Microsoft Malware Detection
    • Representing Missing Values
  • Do Epic Stuff with Data Science
  • Events
    • Data Science Bootcamp DSB
    • Introduction to SQL for Data Science
    • Python Bootcamp
    • Python Bootcamp
  • Gentle Introduction to Markov Chain
  • Jobs
  • Kabir Singh
  • Kaustubh Gupta
  • Landing Page Style Nine
  • Leena
  • Linear Regression in Julia
  • List of Blogs
  • Live
  • Live Course Request Demo
  • Live Data Science Program
  • Machine Learning Plus
  • Machine Learning Plus | Learn Data Science – Python, R, Stats, ML
  • Machine Learning Plus | Learn everything about Python, R, Data Science and AI – Old Design
  • New Landing Page
  • Pranay Lawhatre
  • Privacy Policy
  • Python Collections – An Introductory Guide
  • Python JSON – Guide
  • Refund Policy
  • Shreyansh
  • Shrivarsheni
  • spaCy Tutorial – Complete Writeup
  • subscribe
  • Terms of Use
  • Test Page – To be deleted
  • Test Page for Scaler
  • Test Page for Scaler Iframe
  • Testimonial landing page
  • Testimonial of Chris
  • Testimonial of D Stroy
  • Testimonial of Golda
  • Testimonial of Haris
  • Testimonial of Jayshree
  • Testimonial of Joy
  • Testimonial of Robert
  • Testimonials
  • Testimonials
  • Thank you for Signing Up
  • Venmani
  • Waterfall Plot in Python
  • What it takes to be a Data Scientist at Microsoft
  • 1-Scaling and standardizaation
  • 3-Representing Missing Values
  • 5-Approaches to Filling Missing Data
  • Approach Real Business Problem
  • Attend a Free Class to Experience The MLPlus Industry Data Science Program
  • Attend a Free Class to Experience The MLPlus Industry Data Science Program -IN
  • NOT USED-ARIMA Time Series Forecasting
  • Resources – Data Science Project Template
  • Resources – Data Science Projects Bluebook
  • Resources – Numpy Cheatsheets
  • Resources – Time Series Project Template
  • Useful Function in Numpy

test

Connect with us

YouTube Twitter Instagram Linkedin Facebook

Get our new articles, videos and live sessions info.

Join 54,000+ fine folks. Stay as long as you'd like. Unsubscribe anytime.

We Accept

Payment-Cards
  • Footer Logo

    Learn and master Data Science, AI/ML

  • About

    • About Us
    • Terms of Use
    • Privacy Policy
    • Refund Policy
  • ROADMAP

    • 1. The Complete Roadmap
    • 2. Programming for DS
    • 3. ML Algorithms
    • 4. ML Ops
    • 5.Deep Learning
    • 6. Time Series
    • 7. DS Industry Projects
    • 8. Supplementary Courses
  • OFFERINGS

    • All Courses
    • Complete Univ Access
    • Industry DS Projects
    • Youtube
    • List of Blogs
    • 30 Day DS Interviews Prep
    • Tasklist for DS Projects
    • Jobs
  • HELP

    • Drop a Query
    • FAQ's
    • Contact Us
    • Testimonials
    • Subscribe to newsletter

Copyright 2025 | All Rights Reserved by machinelearningplus

  • Privacy Policy
  • Terms of service
  • Terms & Conditions

test