shlex --- 簡單的語法分析

原始碼:Lib/shlex.py


Theshlex class makes it easy to write lexical analyzers forsimple syntaxes resembling that of the Unix shell. This will often be usefulfor writing minilanguages, (for example, in run control files for Pythonapplications) or for parsing quoted strings.

Theshlex module defines the following functions:

shlex.split(s,comments=False,posix=True)

Split the strings using shell-like syntax. Ifcomments isFalse(the default), the parsing of comments in the given string will be disabled(setting thecommenters attribute of theshlex instance to the empty string). This function operatesin POSIX mode by default, but uses non-POSIX mode if theposix argument isfalse.

在 3.12 版的變更:PassingNone fors argument now raises an exception, rather thanreadingsys.stdin.

shlex.join(split_command)

Concatenate the tokens of the listsplit_command and return a string.This function is the inverse ofsplit().

>>>fromshleximportjoin>>>print(join(['echo','-n','Multiple words']))echo -n 'Multiple words'

The returned value is shell-escaped to protect against injectionvulnerabilities (seequote()).

在 3.8 版被加入.

shlex.quote(s)

Return a shell-escaped version of the strings. The returned value is astring that can safely be used as one token in a shell command line, forcases where you cannot use a list.

警告

Theshlex module isonly designed for Unix shells.

Thequote() function is not guaranteed to be correct on non-POSIXcompliant shells or shells from other operating systems such as Windows.Executing commands quoted by this module on such shells can open up thepossibility of a command injection vulnerability.

Consider using functions that pass command arguments with lists such assubprocess.run() withshell=False.

This idiom would be unsafe:

>>>filename='somefile; rm -rf ~'>>>command='ls -l{}'.format(filename)>>>print(command)# executed by a shell: boom!ls -l somefile; rm -rf ~

quote() lets you plug the security hole:

>>>fromshleximportquote>>>command='ls -l{}'.format(quote(filename))>>>print(command)ls -l 'somefile; rm -rf ~'>>>remote_command='ssh home{}'.format(quote(command))>>>print(remote_command)ssh home 'ls -l '"'"'somefile; rm -rf ~'"'"''

The quoting is compatible with UNIX shells and withsplit():

>>>fromshleximportsplit>>>remote_command=split(remote_command)>>>remote_command['ssh', 'home', "ls -l 'somefile; rm -rf ~'"]>>>command=split(remote_command[-1])>>>command['ls', '-l', 'somefile; rm -rf ~']

在 3.3 版被加入.

Theshlex module defines the following class:

classshlex.shlex(instream=None,infile=None,posix=False,punctuation_chars=False)

Ashlex instance or subclass instance is a lexical analyzerobject. The initialization argument, if present, specifies where to readcharacters from. It must be a file-/stream-like object withread() andreadline() methods, ora string. If no argument is given, input will be taken fromsys.stdin.The second optional argument is a filename string, which sets the initialvalue of theinfile attribute. If theinstreamargument is omitted or equal tosys.stdin, this second argumentdefaults to "stdin". Theposix argument defines the operational mode:whenposix is not true (default), theshlex instance willoperate in compatibility mode. When operating in POSIX mode,shlex will try to be as close as possible to the POSIX shellparsing rules. Thepunctuation_chars argument provides a way to make thebehaviour even closer to how real shells parse. This can take a number ofvalues: the default value,False, preserves the behaviour seen underPython 3.5 and earlier. If set toTrue, then parsing of the characters();<>|& is changed: any run of these characters (considered punctuationcharacters) is returned as a single token. If set to a non-empty string ofcharacters, those characters will be used as the punctuation characters. Anycharacters in thewordchars attribute that appear inpunctuation_chars will be removed fromwordchars. SeeImproved Compatibility with Shells for more information.punctuation_charscan be set only uponshlex instance creation and can't bemodified later.

在 3.6 版的變更:新增punctuation_chars 參數。

也參考

configparser 模組

Parser for configuration files similar to the Windows.ini files.

shlex 物件

Ashlex instance has the following methods:

shlex.get_token()

Return a token. If tokens have been stacked usingpush_token(), pop atoken off the stack. Otherwise, read one from the input stream. If readingencounters an immediate end-of-file,eof is returned (the emptystring ('') in non-POSIX mode, andNone in POSIX mode).

shlex.push_token(str)

Push the argument onto the token stack.

shlex.read_token()

Read a raw token. Ignore the pushback stack, and do not interpret sourcerequests. (This is not ordinarily a useful entry point, and is documented hereonly for the sake of completeness.)

shlex.sourcehook(filename)

Whenshlex detects a source request (seesourcebelow) this method is given the following token as argument, and expectedto return a tuple consisting of a filename and an open file-like object.

Normally, this method first strips any quotes off the argument. If the resultis an absolute pathname, or there was no previous source request in effect, orthe previous source was a stream (such assys.stdin), the result is leftalone. Otherwise, if the result is a relative pathname, the directory part ofthe name of the file immediately before it on the source inclusion stack isprepended (this behavior is like the way the C preprocessor handles#include"file.h").

The result of the manipulations is treated as a filename, and returned as thefirst component of the tuple, withopen() called on it to yield the secondcomponent. (Note: this is the reverse of the order of arguments in instanceinitialization!)

This hook is exposed so that you can use it to implement directory search paths,addition of file extensions, and other namespace hacks. There is nocorresponding 'close' hook, but a shlex instance will call theclose() method of the sourced input stream when it returnsEOF.

For more explicit control of source stacking, use thepush_source() andpop_source() methods.

shlex.push_source(newstream,newfile=None)

Push an input source stream onto the input stack. If the filename argument isspecified it will later be available for use in error messages. This is thesame method used internally by thesourcehook() method.

shlex.pop_source()

Pop the last-pushed input source from the input stack. This is the same methodused internally when the lexer reaches EOF on a stacked input stream.

shlex.error_leader(infile=None,lineno=None)

This method generates an error message leader in the format of a Unix C compilererror label; the format is'"%s",line%d:', where the%s is replacedwith the name of the current source file and the%d with the current inputline number (the optional arguments can be used to override these).

This convenience is provided to encourageshlex users to generate errormessages in the standard, parseable format understood by Emacs and other Unixtools.

Instances ofshlex subclasses have some public instancevariables which either control lexical analysis or can be used for debugging:

shlex.commenters

The string of characters that are recognized as comment beginners. Allcharacters from the comment beginner to end of line are ignored. Includes just'#' by default.

shlex.wordchars

The string of characters that will accumulate into multi-character tokens. Bydefault, includes all ASCII alphanumerics and underscore. In POSIX mode, theaccented characters in the Latin-1 set are also included. Ifpunctuation_chars is not empty, the characters~-./*?=, which canappear in filename specifications and command line parameters, will also beincluded in this attribute, and any characters which appear inpunctuation_chars will be removed fromwordchars if they are presentthere. Ifwhitespace_split is set toTrue, this will have noeffect.

shlex.whitespace

Characters that will be considered whitespace and skipped. Whitespace boundstokens. By default, includes space, tab, linefeed and carriage-return.

shlex.escape

Characters that will be considered as escape. This will be only used in POSIXmode, and includes just'\' by default.

shlex.quotes

Characters that will be considered string quotes. The token accumulates untilthe same quote is encountered again (thus, different quote types protect eachother as in the shell.) By default, includes ASCII single and double quotes.

shlex.escapedquotes

Characters inquotes that will interpret escape characters defined inescape. This is only used in POSIX mode, and includes just'"' bydefault.

shlex.whitespace_split

IfTrue, tokens will only be split in whitespaces. This is useful, forexample, for parsing command lines withshlex, gettingtokens in a similar way to shell arguments. When used in combination withpunctuation_chars, tokens will be split on whitespace in addition tothose characters.

在 3.8 版的變更:Thepunctuation_chars attribute was made compatible with thewhitespace_split attribute.

shlex.infile

The name of the current input file, as initially set at class instantiation timeor stacked by later source requests. It may be useful to examine this whenconstructing error messages.

shlex.instream

The input stream from which thisshlex instance is readingcharacters.

shlex.source

This attribute isNone by default. If you assign a string to it, thatstring will be recognized as a lexical-level inclusion request similar to thesource keyword in various shells. That is, the immediately following tokenwill be opened as a filename and input will be taken from that stream untilEOF, at which point theclose() method of that stream will becalled and the input source will again become the original input stream. Sourcerequests may be stacked any number of levels deep.

shlex.debug

If this attribute is numeric and1 or more, ashlexinstance will print verbose progress output on its behavior. If you needto use this, you can read the module source code to learn the details.

shlex.lineno

Source line number (count of newlines seen so far plus one).

shlex.token

The token buffer. It may be useful to examine this when catching exceptions.

shlex.eof

Token used to determine end of file. This will be set to the empty string(''), in non-POSIX mode, and toNone in POSIX mode.

shlex.punctuation_chars

A read-only property. Characters that will be considered punctuation. Runs ofpunctuation characters will be returned as a single token. However, note that nosemantic validity checking will be performed: for example, '>>>' could bereturned as a token, even though it may not be recognised as such by shells.

在 3.6 版被加入.

Parsing Rules

When operating in non-POSIX mode,shlex will try to obey to thefollowing rules.

  • Quote characters are not recognized within words (Do"Not"Separate isparsed as the single wordDo"Not"Separate);

  • Escape characters are not recognized;

  • Enclosing characters in quotes preserve the literal value of all characterswithin the quotes;

  • Closing quotes separate words ("Do"Separate is parsed as"Do" andSeparate);

  • Ifwhitespace_split isFalse, any character notdeclared to be a word character, whitespace, or a quote will be returned asa single-character token. If it isTrue,shlex will onlysplit words in whitespaces;

  • EOF is signaled with an empty string ('');

  • It's not possible to parse empty strings, even if quoted.

When operating in POSIX mode,shlex will try to obey to thefollowing parsing rules.

  • Quotes are stripped out, and do not separate words ("Do"Not"Separate" isparsed as the single wordDoNotSeparate);

  • Non-quoted escape characters (e.g.'\') preserve the literal value of thenext character that follows;

  • Enclosing characters in quotes which are not part ofescapedquotes (e.g."'") preserve the literal valueof all characters within the quotes;

  • Enclosing characters in quotes which are part ofescapedquotes (e.g.'"') preserves the literal valueof all characters within the quotes, with the exception of the charactersmentioned inescape. The escape characters retain itsspecial meaning only when followed by the quote in use, or the escapecharacter itself. Otherwise the escape character will be considered anormal character.

  • EOF is signaled with aNone value;

  • Quoted empty strings ('') are allowed.

Improved Compatibility with Shells

在 3.6 版被加入.

Theshlex class provides compatibility with the parsing performed bycommon Unix shells likebash,dash, andsh. To take advantage ofthis compatibility, specify thepunctuation_chars argument in theconstructor. This defaults toFalse, which preserves pre-3.6 behaviour.However, if it is set toTrue, then parsing of the characters();<>|&is changed: any run of these characters is returned as a single token. Whilethis is short of a full parser for shells (which would be out of scope for thestandard library, given the multiplicity of shells out there), it does allowyou to perform processing of command lines more easily than you couldotherwise. To illustrate, you can see the difference in the following snippet:

>>>importshlex>>>text="a && b; c && d || e; f >'abc'; (def\"ghi\")">>>s=shlex.shlex(text,posix=True)>>>s.whitespace_split=True>>>list(s)['a', '&&', 'b;', 'c', '&&', 'd', '||', 'e;', 'f', '>abc;', '(def', 'ghi)']>>>s=shlex.shlex(text,posix=True,punctuation_chars=True)>>>s.whitespace_split=True>>>list(s)['a', '&&', 'b', ';', 'c', '&&', 'd', '||', 'e', ';', 'f', '>', 'abc', ';','(', 'def', 'ghi', ')']

Of course, tokens will be returned which are not valid for shells, and you'llneed to implement your own error checks on the returned tokens.

Instead of passingTrue as the value for the punctuation_chars parameter,you can pass a string with specific characters, which will be used to determinewhich characters constitute punctuation. For example:

>>>importshlex>>>s=shlex.shlex("a && b || c",punctuation_chars="|")>>>list(s)['a', '&', '&', 'b', '||', 'c']

備註

Whenpunctuation_chars is specified, thewordcharsattribute is augmented with the characters~-./*?=. That is because thesecharacters can appear in file names (including wildcards) and command-linearguments (e.g.--color=auto). Hence:

>>>importshlex>>>s=shlex.shlex('~/a && b-c --color=auto || d *.py?',...punctuation_chars=True)>>>list(s)['~/a', '&&', 'b-c', '--color=auto', '||', 'd', '*.py?']

However, to match the shell as closely as possible, it is recommended toalways useposix andwhitespace_split when usingpunctuation_chars, which will negatewordchars entirely.

For best effect,punctuation_chars should be set in conjunction withposix=True. (Note thatposix=False is the default forshlex.)