Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Naresh Nishad
Naresh Nishad

Posted on

     

Text Preprocessing for NLP

Day 2: Text Preprocessing for NLP

As part of my#75DaysOfLLM journey, we’re diving intoText Preprocessing. Text preprocessing transforms raw text into clean, structured data for machines to analyze. In this post, we’ll explore the steps involved in preprocessing text, from cleaning to tokenization, stop word removal, and more.

Text Cleaning

Text often contains unwanted elements like HTML tags, punctuation, numbers, and special characters that don’t add value. Cleaning the text involves removing these elements to reduce noise and focus on meaningful content.

Sample Code:

importre# Sample texttext="Hello! This is <b>sample</b> text with numbers (1234) and punctuation!!"# Removing HTML tags and special characterscleaned_text=re.sub(r'<.*?>','',text)# Remove HTML tagscleaned_text=re.sub(r'[^a-zA-Z\s]','',cleaned_text)# Remove punctuation/numbersprint(cleaned_text)
Enter fullscreen modeExit fullscreen mode

Tokenization

Tokenization is the process of breaking down text into smaller units, usually words or sentences. This step allows us to work with individual words (tokens) rather than a continuous stream of text. Each token serves as a unit of meaning that the NLP model can understand.

In tokenization, there are two common approaches:

  • Word Tokenization: Splits the text into individual words.
  • Sentence Tokenization: Splits the text into sentences, which is useful for certain applications like text summarization.

Sample Code:

fromnltk.tokenizeimportword_tokenize# Tokenizationtokens=word_tokenize(cleaned_text)print(tokens)
Enter fullscreen modeExit fullscreen mode

Stop Word Removal

Stop words are common words (like "the", "is", "and") that appear frequently but don’t contribute much meaning. Removing them reduces the dataset size and focuses on more important words.

Sample Code:

fromnltk.corpusimportstopwords# Removing stop wordsstop_words=set(stopwords.words('english'))filtered_tokens=[tokenfortokenintokensiftokennotinstop_words]print(filtered_tokens)
Enter fullscreen modeExit fullscreen mode

Stemming and Lemmatization

Both stemming and lemmatization reduce words to their base form, helping to standardize different forms of the same word (e.g., "running" and "ran"). Stemming is faster but less accurate, while lemmatization returns valid words based on vocabulary.

  • Stemming: Stemming involves chopping off word endings to get to the root form, without worrying about whether the resulting word is valid. It’s fast but less accurate.
  • Lemmatization: Lemmatization reduces words to their base form, but it uses vocabulary and morphological analysis to return valid words. This makes it more accurate than stemming.

Common Algorithms for Stemming and Lemmatization:

  • Porter Stemmer: One of the most popular stemming algorithms.
  • Lancaster Stemmer: A more aggressive stemmer that may truncate words more drastically.
  • WordNet Lemmatizer: Part of the NLTK library, it uses a dictionary to find the correct lemma of a word.

Sample Code:

fromnltk.stemimportPorterStemmer,WordNetLemmatizer# Stemmingstemmer=PorterStemmer()stemmed_tokens=[stemmer.stem(token)fortokeninfiltered_tokens]print("Stemmed Tokens:",stemmed_tokens)# Lemmatizationlemmatizer=WordNetLemmatizer()lemmatized_tokens=[lemmatizer.lemmatize(token)fortokeninfiltered_tokens]print("Lemmatized Tokens:",lemmatized_tokens)
Enter fullscreen modeExit fullscreen mode

Expanding Contractions

Contractions are shortened versions of word groups, such as “can’t” for “cannot” or “I’m” for “I am.” While contractions are natural in everyday language, it’s often useful to expand them during text preprocessing to maintain consistency.

Sample Code:

importcontractions# Expanding contractionsexpanded_text=contractions.fix("I can't do this right now.")print(expanded_text)
Enter fullscreen modeExit fullscreen mode

Spell Check

Spelling errors in text can affect NLP models. Automatic spell check tools can detect and correct common misspellings.

Sample Code:

fromtextblobimportTextBlob# Spell checktext_with_typos="Ths is an exmple of txt with speling errors."corrected_text=str(TextBlob(text_with_typos).correct())print(corrected_text)
Enter fullscreen modeExit fullscreen mode

Conclusion

Preprocessing is a vital first step in any NLP pipeline. Cleaning, tokenizing, removing stop words, and handling tasks like stemming and spell checking ensure that your text data is ready for analysis. Stay tuned for the next part of my#75DaysOfLLM challenge as we dive deeper into NLP and language models!

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

  • Location
    India
  • Joined

More fromNaresh Nishad

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp