Customising the Coke scanner

$Id: coke.html.in 545 2006-11-06 18:04:41Z piumarta $
corresponds to the idst-5.9 release


Contents:
1   Introduction
2   The scanner loop
2.1   Attributes
2.2   Temporary modifications
2.3   Initial state
3   Modifying the scanner from within Coke
4   Scanner protocol
5   Examples
5.1   Scanner macros
5.2   META-II
6   References

1   Introduction

Coke includes a built-in scanner that converts a plain text, parenthesized representation of s-expressions into object structures that represent equivalent ASTs. Although ambitious language processors will certainly replace this scanner with a more exotic and powerful parsing mechanism, the scanner is designed to be fully programmable and extensible from userland dynamic code.

2   The scanner loop

The scanner reads characters from a stream and answers the object they represent. Every character is active; in other words, for every character read from the stream the scanner finds a reader method and invokes it on the stream. If the reader method answers non-nil, the answer is returned to the scanner's caller. If the reader answers nil, the scanner continues with the next character on the stream. (The scanner therefore answers nil only when the stream is empty.)
Scanner read: stream
[
  | char reader object |
  [stream atEnd]
    whileFalse:
     [char := stream peek.
      reader := readers at: char.
      object := self performMethod: reader with: stream.
      object notNil ifTrue: [^object]].
  ^nil
]
For example, the comment character (;) is associated with the following method:
Scanner readComment: stream
[
  [stream atEnd or: [stream next == $\n]] whileFalse.
  ^nil
]
The method gobbles the entire line after the comment character, including the end of line character. Since readComment: answers nil, the scanner does not return a result to its caller but continues scanning immediately with the first character on the next line.

The scanner is re-entrant. For example, the quote character (') is associated with a method that invokes the scanner recursively:

Scanner readQuote: stream
[
  stream next.	"gobble the '"
  ^Expression with: #quote with: (self read: stream)
]
This converts the text
' anything
into the list
(quote anything)
.

The scanner can also read a list of objects from the stream, stopping at an arbitrary delimiting character. This method is invoked whenever an opening parenthesis is encountered:

Scanner readOpenParen: stream
[
  | list |
  stream next.	"gobble the ("
  list := self read: stream list: Expression delimited: $).
  stream next.	"gobble the )"
  ^list
]
Reader functions are installed and queried in a scanner using the following protocol:
Scanner readerAt: aCharacter
Scanner readerAt: aCharacter put: readerMethod
Scanner readerAt: firstCharacter through: lastCharacter put: readerMethod
For example, during initialisation the Coke scanner sends itself these messages:
self
  readerAt: $(             put: (self _methodAt: #readOpenParen:);
  readerAt: $0 through: $9 put: (self _methodAt: #readDigit:);
  "... and so forth ..."

2.1   Attributes

Each character also has an associated attributes object. One such attribute is the symbol #id associated with any character that can appear in the second (and subsequent) positions of an identifier (symbol). (Another is #space, associated with any character that should be considered whitespace.)

Setting a reader method with either of the messages

Scanner readerAt: aCharacter put: readerMethod
Scanner readerAt: firstCharacter through: lastCharacter put: readerMethod
clears the attribute associated with the implicated characters.

To associate a reader method and attribute with one or more characters, two similar methods are provided:

Scanner readerAt: aCharacter put: readerMethod withAttribute: anObject
Scanner readerAt: firstCharacter through: lastCharacter put: readerMethod withAttribute: anObject
Finally, attributes can be accessed independently of reader methods:
Scanner attributeAt: aCharacter
Scanner attributeAt: aCharacter put: anObject
To illustrate these methods, the three messages
aScanner readerAt:    $$ put: (aScanner readerAt: $a).
aScanner readerAt:    $$ put: (aScanner readerAt: $a) withAttribute: #id.
aScanner attributeAt: $$ put: #id.
will ensure aScanner allows dollars as the initial character of identifiers, anywhere in identifiers, or only in the second (and subsequent) characters (but not the first) of identifiers, respectively.

2.2   Temporary modifications

The state of a scanner (associations between characters, reader methods and attributes) can be saved and later restored in a LIFO manner. For example, the following parses one object from a stream with the comment character disabled (it will act like a letter):
| object |
"save the scanner's state"
aScanner save.
"convert ; into a comment"
aScanner readerAt: $; put: (aScanner readerAt: $a) withAttribute: #id.
"read the next object from aStream"
object := aScanner read: aStream.
"restore the scanner's state"
aScanner restore.
Sends of save and restore to a given scanner nest correctly and to an arbitrary depth.

2.3   Initial state

A new 'empty' Scanner associates every character with a reader that immediately signals an 'illegal character' error.

During initialisation of the Coke scanner, the following additional associations are installed:

character(s) reader method attribtue
space and control chars   whitespace #space
0 through 9 integer #id
( list
) unexpected char
" string
' quoted object
; comment
everthing else letter #id

3   Modifying the scanner from within Coke

Methods expect three 'hidden' initial arguments: a vtable closure and two 'self's (one for accessing state, one for sends to self). Provided these arguments are specified before any arguments, the reader table can be extended using lambda expressions from within Coke.

For example, if quotation were not provided by default it could be added dynamically from Coke as follows:

(define CokeScanner (import "CokeScanner"))

(define read-quote
  (lambda (_closure _self self stream)
    [stream next]	; gobble the '
    [Expression with: 'quote with: [self read: stream]]))

[CokeScanner readerAt: '$' put: read-quote]

4   Scanner protocol

This section lists all public message to which Scanner responds.
new
Answers a new Scanner.

Scanner readerAt: aCharacter
Answers the reader method for aCharacter.

Scanner readerAt: aCharacter put: aMethod
Sets the reader method for aCharacter.

Scanner attributeAt: aCharacter
Answers the attribute for aCharacter.

Scanner attribtueAt: aCharacter put: anObject
Sets the attribute for aCharacter.

Scanner readerAt: aCharacter put: aMethod withAttribute: anObject
Sets the reader method and attribute for aCharacter.

Scanner readerAt: firstCharacter through: lastCharacter put: aMethod withAttribute: anObject
Sets the reader method and attribute for characters in the range firstCharacter to firstCharacter inclusive.

Scanner read: inputStringOrStream
Answers the next scanned object from the inputStringOrStream, or nil if the end of the stream is encountered.

Scanner read: inputStream list: listType delimited: terminator
Scans objects from the inputStream until the terminator character is encountered (where the first character of an object would be expected). Answers a collection of the given listType containing those objects, or if the end of stream is encountered. The terminator character is removed from the inputStream and discarded.

Scanner ignoreSpaces: inputStream
Reads and discards characters from the inputStream as long as those characters attribute is #space. Answers the first non-#space character, or nil if the end of the inputStream is reached.

Scanner read: inputStream string: stringType delimited: terminator withEscapes: escapeTable
Creates a collection of the given stringType and populates it with all objects (usually characters) read from inputStream until the terminator is encountered. The terminator is removed from the inputStream and discarded. If a backslash (\) character is encountered in the stream, the character following it is added to the string (even if it is the terminator) unless that character has a corresponding entry in the escapeTable in which case that entry's value is substituted.

5   Examples

5.1   Scanner macros

The Coke compiler is designed to be extended with new AST node types (using the syntax form). These are similar to macros but work only on object structures. More 'traditional' macros, performing purely textual substitution, are not built into the scanner but can be added by user code.

First we need a small structure to hold macro definitions, which will resemble closures, and a dictionary that maps macro names to their definitions.

(syntax macro-method	(lambda (node comp) `(long@ ,[node second]  )))
(syntax macro-data	(lambda (node comp) `(long@ ,[node second] 1)))

(define make-macro
  (lambda (method data)
    (let ((macro (malloc 8)))
      (set (macro-method macro) method)
      (set (macro-data   macro) data)
      macro)))

(define %macros	[SlotDictionary new])
To define a macro we add a macro structure to this dictionary associating the name with the closure.
(define define-macro
  (lambda (name method data)
    [%macros at: name put: (make-macro method data)]))
Macro (control sequence) names will be prefixed by the backslash (\) character. When the scanner sees this character it will invoke the control-sequence reader:
(define cs-reader
  (method (stream)
    [stream next]
    (let ((csname [self read: stream]))
      (or [csname isSymbol] (error "missing control sequence name"))
      (let ((definition [%macros at: csname]))
	(or definition (error "undefined control sequence: \\%s" [[csname asString] _stringValue]))
	((macro-method definition) _closure _self self stream (macro-data definition))))))
A macro definition will be either a single object or a list of objects. To delimit lists of object we'll surround them with braces ({...}) to create a group. (Closing braces should always be consumed by the list reader and never seen by the scanner; they will therefore be bound to an error method.)
(define group-reader
  (method (stream)
    [stream next]
    [self read: stream list: Expression delimited: '$\}]))

(define group-error
  (method (stream)
    [self error: stream unexpected: [stream peek]]))
The default behaviour for a macro will be to insert its data into the parse.
(define %insert
  (method (stream data)
    data))
The only predefined macro will be the macro-defining macro. Its behaviour will be to scan forwards in the stream looking for a macro name followed by either a single token or a group of tokens.
(define-macro 'def
  (method (stream data)
    (or (and [self ignoreSpaces: stream] (== '$\\ [stream next]))
	(error "missing control sequence name in \\def"))
    (let ((csname [self read: stream]))
      (or [csname isSymbol] (error "missing control sequence name in \\def"))
      (define-macro csname %insert [self read: stream]))
    0))
Finally, we associate the new active characters with their associated methods. We can then use our textual macros immediately, in-line in the source file.
[CokeScanner readerAt: '$\\ put: cs-reader]
[CokeScanner readerAt: '$\{ put: group-reader]
[CokeScanner readerAt: '$\} put: group-error]

\def\foo 42
\def\bar{a sequence}
\def\baz{This is \bar of tokens}

[StdOut println: '(one \foo two \baz three \foo four)]
Running the above program produces:
(#one 42 #two (#This #is (#a #sequence) #of #tokens) #three 42 #four)
The above bear a striking resemblance to TeX macros. Extending them to implement the full TeX macro semantics would not be difficult (probably no more than tripling the size of the above code).

To save your typing finger, the above example is available in the idst distribution as function/examples/macros.

5.2   META-II

We can add Val D. Schorre's META-II syntax [1] directly to the scanner using an active 'escape' character. The approach described here is similar to that described by Henry Baker [2] for Common Lisp but differs in that the scanner changes needed to read META expresions will only be activated within the expression immediately following the 'escape' character.

We are going to teach the scanner how to transform META markup into AST nodes (the head will identify the META operation being encoded) and add a 'META compiler' to the parser that will convert these marked-up nodes into an executable recogniser. When the scanner sees the active 'escape' character it will read the next expression form the stream, with META 'markup', and return the result of 'compiling' the META program it contains to the scanner's caller (usually the Coke compiler).

First we need a helper function, a version of 'map' that applies a transformation to each element in a list.

(define meta-compile 0)	; forward

(define meta-map	; poor man's mapcar
  (lambda (function expr)
    (let ((limit  [expr size])
	  (result [Expression new: limit])
	  (index  '0))
      (while (< index limit)
	[result at: index put: (function [expr at: index])]
	(set index [index + '1]))
      result)))
Our 'META compiler' will deal with two kinds of object in AST nodes: meta-form compiles a node that encode a META operation, and meta-other compiles everything else. Both answer true or false depending on whether their 'match' succeeds. By default, meta-other succeeds if its object in the META pattern is identical to the next element read from the input stream. (It could be made arbitrarily more sophisticated, for example checking structural equivalance or performing unification.)
(define meta-other	; default matcher for non-meta constructs within pattern
  (lambda (expr)
    `(and (== ',expr [stream peek])
	  (let () [stream next] 1))))

(define meta-form	; matchers for meta constructs
  (lambda (node)
    (let ((type [node first])
	  (expr [node second]))
      (cond
       ((== type 'meta-repeat)		`(let () (while ,(meta-compile expr)) 1))
       ((== type 'meta-predicate)	expr)
       ((== type 'meta-alternate)	`(or  ,@(meta-map meta-compile expr)))
       ((== type 'meta-sequence)	`(and ,@(meta-map meta-compile expr)))
       ((== type 'meta-unify)		(if (and [expr isArray] (== '2 [expr size]))
					    `(and (,[expr first] [stream peek])
						  (let ()
						    (set ,[expr second] [stream next])
						    1))
					    [expr error: '"syntax error near #@"]))
       (1				(meta-other node))))))
The META compiler recursively checks for a possible META expression and calls meta-form or meta-form as appropriate.
(define meta-compile	; compile meta pattern
  (lambda (expr)
    (if (and [expr isArray] (== '2 [expr size]))
	(meta-form expr)
	(meta-other expr))))
A META expression will be introduced by an active 'escape' character buried in a Coke program. This character will invoke a meta sequence reader.
expression   meaning
{ e... } any one of the elements e...
[ e... ] all of the elements e... in order
$ e any number (zero or more) of e
@ ( p v ) anything satisfying predicate p
that will be bound to variable v
! p the predicate p (any Coke expression)
The meta sequence reader looks for one of the five META characters that prefix META expressions and converts it into a marked-up AST node accordingly. The META compiler recursively checks for a possible META expression and calls meta-form or meta-form as appropriate.
(define meta-sequence-reader
  (method (stream)
    (let ((delim [stream next]))
      (cond
       ((== delim '${)	[Expression with: 'meta-alternate with: [self read: stream list: Expression delimited: '$}]])
       ((== delim '$[)	[Expression with: 'meta-sequence  with: [self read: stream list: Expression delimited: '$]]])
       ((== delim '$$)	[Expression with: 'meta-repeat    with: [self read: stream]])
       ((== delim '$@)	[Expression with: 'meta-unify     with: [self read: stream]])
       ((== delim '$!)	[Expression with: 'meta-predicate with: [self read: stream]])
       (1		[self error: stream unexpected: delim])))))
META expressions will be prefixed by the escape character '?'. The scanner is temporarily modifed (for the duration of scanning the next expression from the input stream) to add markup in response to the above META operators. (As in the previous example, closing delimiters are made illegal for the duration of META expressions.)
(define meta-sequence-illegal
  (method (stream)
    [self error: stream unexpected: [stream next]]))

;; make ? an active character that enables meta syntax for the following expression only and then compiles the result

(define meta-sequence-escape
  (method (stream)
    [stream next]
    [self save]
    [self readerAt: '$\\ put: [self readerAt: '$$]]
    [self readerAt: '${  put: meta-sequence-reader]  [self readerAt: '$} put: meta-sequence-illegal]
    [self readerAt: '$[  put: meta-sequence-reader]  [self readerAt: '$] put: meta-sequence-illegal]
    [self readerAt: '$$  put: meta-sequence-reader]
    [self readerAt: '$@  put: meta-sequence-reader]
    [self readerAt: '$!  put: meta-sequence-reader]
    (let ((result [self read: stream]))
      [self restore]
      (meta-compile result))))

[CokeScanner readerAt: '$? put: meta-sequence-escape withAttribute: 'id]		; allow ? to appear embedded in symbols
The META syntax is now available immediately to the program. We can use it to write a parser for integers (after the example in Baker [2]).
;;; Example: integer parsing (see [2, Section 1])

(define digit	(lambda (x) (and (<= '$0 x) (<= x '$9))))	; predicate

(define parse-int
  (lambda (string)
    (let ((stream [string readStream])
	  (s '1) (d  0) (n '0))
      (if ? [ { \+ [ \- !(set s '-1) ] !1 }
	      @(digit d) !(set n (send 'digitValue d))
	      $ [ @(digit d) !(set n (send '+ (send '* n '10) (send 'digitValue d))) ] ]
	  [s * n]
	  'error))))

[StdOut print: (parse-int '"0")]	[StdOut cr]
[StdOut print: (parse-int '"42")]	[StdOut cr]
[StdOut print: (parse-int '"+1")]	[StdOut cr]
[StdOut print: (parse-int '"-1")]	[StdOut cr]
[StdOut print: (parse-int '"+42")]	[StdOut cr]
[StdOut print: (parse-int '"-666")]	[StdOut cr]
[StdOut print: (parse-int '"-haha")]	[StdOut cr]
Running the above program produces the following output:
0
42
1
-1
42
-666
#error
To save your typing finger, the above example is available in the idst distribution as function/examples/prag-parse.

6   References

[1]Schorre, D. V. "META II: A Syntax-Oriented Compiler Writing Language". Proc. 19'th Nat'l. Conf. of the ACM, August 1964, pp. D1.3-1--D1.3-11.
[2]Baker, Henry G. "Pragmatic Parsing in Common Lisp". ACM Lisp Pointers 4(2), April--June 1991, pp. 3--15.