Posted onSep 21, 2024

Text Preprocessing for NLP

#75daysofllm #llm #ai

Day 2: Text Preprocessing for NLP

As part of my#75DaysOfLLM journey, we’re diving intoText Preprocessing. Text preprocessing transforms raw text into clean, structured data for machines to analyze. In this post, we’ll explore the steps involved in preprocessing text, from cleaning to tokenization, stop word removal, and more.

Text Cleaning

Text often contains unwanted elements like HTML tags, punctuation, numbers, and special characters that don’t add value. Cleaning the text involves removing these elements to reduce noise and focus on meaningful content.

Sample Code:

importre# Sample texttext="Hello! This is <b>sample</b> text with numbers (1234) and punctuation!!"# Removing HTML tags and special characterscleaned_text=re.sub(r'<.*?>','',text)# Remove HTML tagscleaned_text=re.sub(r'[^a-zA-Z\s]','',cleaned_text)# Remove punctuation/numbersprint(cleaned_text)

Tokenization

Tokenization is the process of breaking down text into smaller units, usually words or sentences. This step allows us to work with individual words (tokens) rather than a continuous stream of text. Each token serves as a unit of meaning that the NLP model can understand.

In tokenization, there are two common approaches:

Word Tokenization: Splits the text into individual words.
Sentence Tokenization: Splits the text into sentences, which is useful for certain applications like text summarization.

Sample Code:

fromnltk.tokenizeimportword_tokenize# Tokenizationtokens=word_tokenize(cleaned_text)print(tokens)

Stop Word Removal

Stop words are common words (like "the", "is", "and") that appear frequently but don’t contribute much meaning. Removing them reduces the dataset size and focuses on more important words.

Sample Code:

fromnltk.corpusimportstopwords# Removing stop wordsstop_words=set(stopwords.words('english'))filtered_tokens=[tokenfortokenintokensiftokennotinstop_words]print(filtered_tokens)

Stemming and Lemmatization

Both stemming and lemmatization reduce words to their base form, helping to standardize different forms of the same word (e.g., "running" and "ran"). Stemming is faster but less accurate, while lemmatization returns valid words based on vocabulary.

Stemming: Stemming involves chopping off word endings to get to the root form, without worrying about whether the resulting word is valid. It’s fast but less accurate.
Lemmatization: Lemmatization reduces words to their base form, but it uses vocabulary and morphological analysis to return valid words. This makes it more accurate than stemming.

Common Algorithms for Stemming and Lemmatization:

Porter Stemmer: One of the most popular stemming algorithms.
Lancaster Stemmer: A more aggressive stemmer that may truncate words more drastically.
WordNet Lemmatizer: Part of the NLTK library, it uses a dictionary to find the correct lemma of a word.

Sample Code:

fromnltk.stemimportPorterStemmer,WordNetLemmatizer# Stemmingstemmer=PorterStemmer()stemmed_tokens=[stemmer.stem(token)fortokeninfiltered_tokens]print("Stemmed Tokens:",stemmed_tokens)# Lemmatizationlemmatizer=WordNetLemmatizer()lemmatized_tokens=[lemmatizer.lemmatize(token)fortokeninfiltered_tokens]print("Lemmatized Tokens:",lemmatized_tokens)

Expanding Contractions

Contractions are shortened versions of word groups, such as “can’t” for “cannot” or “I’m” for “I am.” While contractions are natural in everyday language, it’s often useful to expand them during text preprocessing to maintain consistency.

Sample Code:

importcontractions# Expanding contractionsexpanded_text=contractions.fix("I can't do this right now.")print(expanded_text)

Spell Check

Spelling errors in text can affect NLP models. Automatic spell check tools can detect and correct common misspellings.

Sample Code:

fromtextblobimportTextBlob# Spell checktext_with_typos="Ths is an exmple of txt with speling errors."corrected_text=str(TextBlob(text_with_typos).correct())print(corrected_text)

Conclusion

Preprocessing is a vital first step in any NLP pipeline. Cleaning, tokenizing, removing stop words, and handling tasks like stemming and spell checking ensure that your text data is ready for analysis. Stay tuned for the next part of my#75DaysOfLLM challenge as we dive deeper into NLP and language models!