This repository was archived by the owner on Feb 12, 2024. It is now read-only.

nette/tokenizerPublic archive

NotificationsYou must be signed in to change notification settings
Fork23
Star140

[DISCONTINUED] Source code tokenizer

doc.nette.org/tokenizer

License

View license

140 stars 23 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
.github		.github
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
composer.json		composer.json
contributing.md		contributing.md
license.md		license.md
phpstan.neon		phpstan.neon
readme.md		readme.md

Repository files navigation

Nette Tokenizer [DISCONTINUED]

Introduction

Tokenizer is a tool that uses regular expressions to split given string into tokens. What the hell is that good for, you might ask? Well, you can create your own languages!

Documentation can be found on thewebsite. If you like it,please make a donation now. Thank you!

Installation:

composer require nette/tokenizer

It requires PHP version 7.1 and supports PHP up to 8.1.

Support Me

Do you like Nette Tokenizer? Are you looking forward to the new features?

Thank you!

Usage

Let's create a simple tokenizer that separates strings to numbers, whitespaces, and letters.

$tokenizer =newNette\Tokenizer\Tokenizer([T_DNUMBER =>'\d+',T_WHITESPACE =>'\s+',T_STRING =>'\w+',]);

Hint: In case you are wondering where the T_ constants come from, they areinternal type used for parsing code. They cover most of the common token names we usually need. Keep in mind their value is not guaranteed so don't use numbers for comparison.

Now when we give it a string, it will return streamNette\Tokenizer\Stream of tokensNette\Tokenizer\Token.

$stream =$tokenizer->tokenize("say\n123");

The resulting array of tokens$stream->tokens would look like this.

[newToken('say',T_STRING,0),newToken("\n",T_WHITESPACE,3),newToken('123',T_DNUMBER,5),]

Also, you can access the individual properties of token:

$firstToken =$stream->tokens[0];$firstToken->value;// say$firstToken->type;// value of T_STRING$firstToken->offset;// position in string: 0

Simple, isn't it?

Processing the tokens

Now we know how to create tokens from string. Let's effectively process them usingNette\Tokenizer\Stream. It has a lot of really awesome methods if you need to traverse tokens!

Let's try to parse a simple annotation from PHPDoc and create an object from it. What regular expressions do we need for tokens? All the annotations start with@, then there is a name, whitespace and it's value.

@ for the annotation start
\s+ for whitespaces
\w+ for strings

(Never use capturing subpatterns in Tokenizer's regular expressions like'(ab)+c', use only non-capturing ones'(?:ab)+c'.)

This should work on simple annotations, right? Now let's show input string that we will try to parse.

$input ='@author David Grudl@package Nette';

Let's create aParser class that will accept the string and return an array of pairs[name, value]. It will be very naive and simple.

useNette\Tokenizer\Tokenizer;useNette\Tokenizer\Stream;class Parser{constT_AT =1;constT_WHITESPACE =2;constT_STRING =3;/** @var Tokenizer */private$tokenizer;/** @var Stream */private$stream;publicfunction__construct(){$this->tokenizer =newTokenizer([self::T_AT =>'@',self::T_WHITESPACE =>'\s+',self::T_STRING =>'\w+',]);}publicfunctionparse(string$input):array{$this->stream =$this->tokenizer->tokenize($input);$result = [];while ($this->stream->nextToken()) {if ($this->stream->isCurrent(self::T_AT)) {$result[] =$this->parseAnnotation();}}return$result;}privatefunctionparseAnnotation():array{$name =$this->stream->joinUntil(self::T_WHITESPACE);$this->stream->nextUntil(self::T_STRING);$content =$this->stream->joinUntil(self::T_AT);return [$name,trim($content)];}}

$parser =newParser;$annotations =$parser->parse($input);

So what theparse() method does? It iterates over the tokens and searches for@ which is the symbol annotations start with. CallingnextToken() moves the cursor to the next token. MethodisCurrent() checks if the current token at the cursor is the given type. Then, if the@ is found, theparse() method callsparseAnnotation() which expects the annotations to be in a very speficic format.

First, using the methodjoinUntil(), the stream keeps moving the cursor and appending the values of the tokens to the buffer until it finds token of the required type, then stops and returns the buffer output. Because there is only one token of typeT_STRING at that given position and it's'name', there will be value'name' in variable$name.

MethodnextUntil() is similar likejoinUntil() but it has no buffer. It only moves the cursor until it finds the token. So this call simply skips all the whitespaces after the annotation name.

And then, there is anotherjoinUntil(), that searches for next@. This specific call will return"David Grudl\n ".

And there we go, we've parsed one whole annotation! The$content probably ends with whitespaces, so we have to trim it. Now we can return this specific annotation as pair[$name, $content].

Try copypasting the code and running it. If you dump the$annotations variable it should return some similar output.

array (2)   0 => array (2)   |  0 => 'author'   |  1 => 'David Grudl'   1 => array (2)   |  0 => 'package'   |  1 => 'Nette'

Stream methods

The stream can return current token using methodcurrentToken() or only it's value usingcurrentValue().

nextToken() moves the cursor and returns the token. If you give it no arguments, it simply returns the next token.

nextValue() is just likenextToken() but it only returns the token value.

Most of the methods also accept multiple arguments so you can search for multiple types at once.

// iterate until a string or a whitespace is found, then return the following token$token =$stream->nextToken(T_STRING,T_WHITESPACE);// give me next token$token =$stream->nextToken();

You can also search by the token value.

// move the cursor until you find token containing only '@', then stop and return it$token =$stream->nextToken('@');

nextUntil() moves the cursor and returns the an array of all the tokens it sees until it finds the desired token, but it stops before the token. It can accept multiple arguments.

joinUntil() is similar tonextUntil(), but concatenates all the tokens it passed through and returns string.

joinAll() simply concatenates all the remaining token values and returns it. It moves the cursor to the end of the token stream

nextAll() is just likejoinAll(), but it returns array of the tokens.

isCurrent() checks if the current token or the current token's value is equal to one of the given arguments.

// is the current token '@' or type of T_AT?$stream->isCurrent(T_AT,'@');

isNext() is just likeisCurrent() but it checks the next token.

isPrev() is just likeisCurrent() but it checks the previous token.

And the last methodreset() resets the cursor, so you can iterate the token stream again.

About

[DISCONTINUED] Source code tokenizer

doc.nette.org/tokenizer

Code of conduct

Security policy

Activity

Custom properties

Stars

140 stars

Watchers

33 watching

Forks

23 forks

Report repository

Releases7

Released version 3.1.0 Latest

Sep 13, 2019

+ 6 releases

Sponsor this project

https://nette.org/donate

Learn more about GitHub Sponsors

Contributors9

Languages

PHP100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Nette Tokenizer [DISCONTINUED]

Introduction

Support Me

Usage

Processing the tokens

Stream methods

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases7

Sponsor this project

Uh oh!

Uh oh!

Contributors9

Uh oh!

Languages