Movatterモバイル変換

Natural Language Processing

From Wikiversity

Introduction

Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Natural language generation systems convert information from computer databases into normal-sounding human language, and natural language understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.

The field is divided in two major categories:

Speech Recognition and generation, and
text interpretation and generation.

The technologies needed for both are very different, "speech" being addressed normally by Electronic or Telecommunications Engineers, while "text" being more addressed by Computer Scientists. Additionally, Speech recognition and speech generation are normally architectured as a layer over a Text recognition / generation engine.

To follow a course on NLP, you must know aboutFormal languages andGrammars, and be able to program fluently. For the practicals, you must learn aLogic programming language;Prolog is recommended. Arule based programming language can serve as well, although it perhaps is more difficult to master.

Document Decomposition

[edit |edit source]

If you look on this text above

thedocument consists ofsections,
eachsection consists ofparagraphs,
eachparagraph consists ofsentences,
eachsentence consist ofwords and
words consist ofcharacters/symbols.

Furthermore the text contains more information we could identify, e.g. a header or links. In general a document can contain other document elements images, audio, video or interactive elements like forms or geometric construction that can be manipulated by the reader.

Abstract Syntax Tree

[edit |edit source]

A tree can represent the decomposition of text into substructures. The decomposition of text into sections is e.g. the top level of anAbstract Syntax Tree (AST).

Learning Activities

[edit |edit source]

(Abstract Syntax Tree) Draw the top level of theAbstract Syntax Tree (AST) that decomposes this text into sections,
(Language Section Decomposition) Look into specific languages likeHTML,Markdown,Latex, ... and identify syntax elements that are available to determine headers in a texts,
(Regular Expression) write aregular expression e.g. in Javascript that extracts headers (e.g. "Introduction") and the level of the header (i.e. section, subsection, ... )
- in HTML the level 1 is defined with h1, h2, ... tags, e.g. for section header "introduction" and a subsection "My Subsection" with the level 2

  <h1>Introduction</h1>   Text   <h2>My Subsection </h2>   Text of subsection

in Latex the levels are defined with \section{...}, \subsection{...}, ... command, e.g. for section header "introduction" and a subsection "My Subsection" with the level 2 are defined in Latex with the following commands

  \section{Introduction}   Text   \section{My Subsection}}   Text of subsection

Movatterモバイル変換

Introduction

Document Decomposition

Abstract Syntax Tree

Learning Activities

See also