cobolg

COBOL grammar.

Master	Develop

Notes about cobolg parser

COBOL originally had to be written in a fixed width form, bearing 72 columns, comprising fixed width zones (or areas), in order: sequence number area (6 chars), indicator area (1 char), area A (4 chars), area B (61 chars). Everything from 73rd column and beyond (until end-of-line) is ignored.

Here's an example of fixed format COBOL source (here in a traditional 80 characters form):

	0        1         2         3         4         5         6         7         8
	12345678901234567890123456789012345678901234567890123456789012345678901234567890
	--------------------------------------------------------------------------------
	000100 ID DIVISION.                                                     IGNORED-
	000200 PROGRAM-ID. PROGRAMNAME.                                         IGNORED-
	000300 AUTHOR. RSLEMOS.                                                 IGNORED-
	000400******************************************************************IGNORED-
	000500 ENVIRONMENT DIVISION.                                            IGNORED-
	000600 CONFIGURATION SECTION.                                           IGNORED-
	000700 SPECIAL-NAMES.                                                   IGNORED-
	000800     DECIMAL-POINT IS COMMA.                                      IGNORED-
	000900******************************************************************IGNORED-
	001000 DATA DIVISION.                                                   IGNORED-
	001100 WORKING-STORAGE SECTION.                                         IGNORED-
	001200 77  FILLER                          PIC X(50)   VALUE 'THIS STRINIGNORED-
	001300-    'G IS TOO LONG TO FIT IN THE LINE ABOVE'                     IGNORED-
	001400******************************************************************IGNORED-
	001500 PROCEDURE DIVISION.                                              IGNORED-
	001600     GOBACK.                                                      IGNORED-

More recently COBOL can be written in a more loose style, the free format, that comprises just areas (also called margins) A and B.

Here's an example of free format COBOL source:

	ID DIVISION.
	PROGRAM-ID. PROGRAMNAME.
	AUTHOR. RSLEMOS.
	*****************************************************************
	ENVIRONMENT DIVISION.
	CONFIGURATION SECTION.
	SPECIAL-NAMES.
	    DECIMAL-POINT IS COMMA.
	*****************************************************************
	DATA DIVISION.
	WORKING-STORAGE SECTION.
	77  FILLER                          PIC X(50)   VALUE 'THIS STRING IS TOO LONG TO FIT IN THE LINE ABOVE'
	*****************************************************************
	PROCEDURE DIVISION.
	    GOBACK.

In cobolg the same grammar is used for both formats. Only the lexers work a bit differently. The lexer for free format is modeless and requires less effort to understand (supposing you know how to read ANTLR lexers).

How does lexer for fixed format work?

ANTLR lexers work by reading characters, one at a time, and grouping them in tokens. The output is one (or more) stream of tokens. There is no room for the concept of fixed width areas. In cobolg the fixed width areas are artificially introduced in the source code, by stuffing single chars in specific positions.

Here's the same example as above of fixed format COBOL source, marked with | (pipeline) characters (the characters actually used are in the range U+EBA0 to U+EBA3, inside unicode private use range).

	 0          1         2         3         4         5         6         7          8
	|123456|7|89012345678901234567890123456789012345678901234567890123456789012|34567890
	|------|-|-----------------------------------------------------------------|--------
	|000100| |ID DIVISION.                                                     |IGNORED-
	|000200| |PROGRAM-ID. PROGRAMNAME.                                         |IGNORED-
	|000300| |AUTHOR. RSLEMOS.                                                 |IGNORED-
	|000400|*|*****************************************************************|IGNORED-
	|000500| |ENVIRONMENT DIVISION.                                            |IGNORED-
	|000600| |CONFIGURATION SECTION.                                           |IGNORED-
	|000700| |SPECIAL-NAMES.                                                   |IGNORED-
	|000800| |    DECIMAL-POINT IS COMMA.                                      |IGNORED-
	|000900|*|*****************************************************************|IGNORED-
	|001000| |DATA DIVISION.                                                   |IGNORED-
	|001100| |WORKING-STORAGE SECTION.                                         |IGNORED-
	|001200| |77  FILLER                          PIC X(50)   VALUE 'THIS STRIN|IGNORED-
	|001300|-|    'G IS TOO LONG TO FIT IN THE LINE ABOVE'.                    |IGNORED-
	|001400|*|*****************************************************************|IGNORED-
	|001500| |PROCEDURE DIVISION.                                              |IGNORED-
	|001600| |    GOBACK.                                                      |IGNORED-
	 ───┬── ┬ ────────────────────────────────┬──────────────────────────────── ────┬───
        │   └ INDICATOR                    DEFAULT                        SKIPTOEOL ┘
        └ SEQUENCE

Lexer modes are suffixed with _MODE. They are 8:

Mode name	Areas in effect	Notes
DEFAULT_MODE	begin of line and DEFAULT	after newline anywhere, but before U+EBA0
SEQUENCE_MODE	SEQUENCE
INDICATOR_MODE	INDICATOR
PRE_DEFAULT_MODE	INDICATOR	after ' ', but before U+EBA2
PRE_COMMENT_MODE	INDICATOR	after '*', but before U+EBA2
COMMENT_MODE	DEFAULT	if INDICATOR is '*'
PRE_CONTINUATION_MODE	INDICATOR	after '-', but before U+EBA2
CONTINUATION_MODE	DEFAULT	if INDICATOR is '-'
SKIPTOEOL_MODE	SKIPTOEOL

Besides being used in the DEFAULT area, the DEFAULT_MODE also starts the line. So the first U+EBA0 switches to SEQUENCE_MODE. Any newline switches back to DEFAULT_MODE.

The DEFAULT area can be in two other modes:

COMMENT_MODE, if INDICATOR area happens to have an '*';
CONTINUATION_MODE, if INDICATOR area happens to have an '-'.

Since the complete switch to DEFAULT_MODE, COMMENT_MODE and CONTINUATION_MODE happen only at U+EBA2, the ' ', '*' and '-' characters in INDICATOR_MODE switches to PRE_ modes:

PRE_DEFAULT_MODE;
PRE_COMMENT_MODE;
PRE_CONTINUATION_MODE.

The U+EBA0-U+EBA3 characters are used to switch lexer modes. They are assigned to the MARK channel. As every other char is directed to either the DEFAULT or the HIDDEN channel, then the input could be reconstructed by gluing all the tokens in these channels together.

For the above source code, produced tokens are:

MODE BEFORE	TYPE	TEXT	CHANNEL	FREE FORMAT
	SWITCH_TO_SEQUENCEAREA_MODE	U+EBA0	MARK
SEQUENCEAREA_MODE	SEQUENCE_NUMBER	000100	HIDDEN
SEQUENCEAREA_MODE	SWITCH_TO_INDICATORAREA_MODE	U+EBA1	MARK
INDICATORAREA_MODE	INDICATOR_BLANK	␢	HIDDEN
PRE_DEFAULT_MODE	SWITCH_TO_DEFAULT_MODE	U+EBA2	MARK
	IDENTIFICATION	ID		✔
	WS	␢	HIDDEN	✔
	DIVISION	DIVISION		✔
	PERIOD	.		✔
	WS	␢ (×53)	HIDDEN	✔
	SWITCH_TO_SKIPTOEOLAREA_MODE	U+EBA3	MARK
SKIPTOEOLAREA_MODE	SKIP_TO_EOL	IGNORED+\n	HIDDEN	(only \n)

	SWITCH_TO_SEQUENCEAREA_MODE	U+EBA0	MARK
SEQUENCEAREA_MODE	SEQUENCE_NUMBER	000200	HIDDEN
SEQUENCEAREA_MODE	SWITCH_TO_INDICATORAREA_MODE	U+EBA1	MARK
INDICATORAREA_MODE	INDICATOR_BLANK	␢	HIDDEN
PRE_DEFAULT_MODE	SWITCH_TO_DEFAULT_MODE	U+EBA2	MARK
	PROGRAM_ID	PROGRAM-ID		✔
	PERIOD	.		✔
	WS	␢	HIDDEN	✔
	ID	PROGRAMNAME		✔
	PERIOD	.		✔
	WS	␢ (×41)	HIDDEN	✔
SKIPTOEOLAREA_MODE	SKIP_TO_EOL	IGNORED+\n	HIDDEN	(only \n)

	SWITCH_TO_SEQUENCEAREA_MODE	U+EBA0	MARK
SEQUENCEAREA_MODE	SEQUENCE_NUMBER	000300	HIDDEN
SEQUENCEAREA_MODE	SWITCH_TO_INDICATORAREA_MODE	U+EBA1	MARK
INDICATORAREA_MODE	INDICATOR_BLANK	␢	HIDDEN
PRE_DEFAULT_MODE	SWITCH_TO_DEFAULT_MODE	U+EBA2	MARK
	AUTHOR	AUTHOR		✔
	PERIOD	.		✔
	WS	␢	HIDDEN	✔
	ID	RSLEMOS		✔
	PERIOD	.		✔
	WS	␢ (×47)	HIDDEN	✔
SKIPTOEOLAREA_MODE	SKIP_TO_EOL	IGNORED+\n	HIDDEN	(only \n)

	SWITCH_TO_SEQUENCEAREA_MODE	U+EBA0	MARK
SEQUENCEAREA_MODE	SEQUENCE_NUMBER	000400	HIDDEN
SEQUENCEAREA_MODE	SWITCH_TO_INDICATORAREA_MODE	U+EBA1	MARK
INDICATORAREA_MODE	INDICATOR_COMMENT	*	HIDDEN
PRE_COMMENTAREA_MODE	SWITCH_TO_COMMENTAREA_MODE	U+EBA2	MARK
COMMENTAREA_MODE	COMMENTAREA	* (×65)	HIDDEN	(other)
SKIPTOEOLAREA_MODE	SKIP_TO_EOL	IGNORED+\n	HIDDEN	(only \n)

	SWITCH_TO_SEQUENCEAREA_MODE	U+EBA0	MARK
SEQUENCEAREA_MODE	SEQUENCE_NUMBER	000500	HIDDEN
SEQUENCEAREA_MODE	SWITCH_TO_INDICATORAREA_MODE	U+EBA1	MARK
INDICATORAREA_MODE	INDICATOR_BLANK	␢	HIDDEN
PRE_DEFAULT_MODE	SWITCH_TO_DEFAULT_MODE	U+EBA2	MARK
	ENVIRONMENT	ENVIRONMENT		✔
	WS	␢	HIDDEN	✔
	DIVISION	DIVISION		✔
	PERIOD	.		✔
	WS	␢ (×44)	HIDDEN	✔
	SWITCH_TO_SKIPTOEOLAREA_MODE	U+EBA3	MARK
SKIPTOEOLAREA_MODE	SKIP_TO_EOL	IGNORED+\n	HIDDEN	(only \n)
⋮	⋮	⋮	⋮	⋮
	SWITCH_TO_SEQUENCEAREA_MODE	U+EBA0	MARK
SEQUENCEAREA_MODE	SEQUENCE_NUMBER	001200	HIDDEN
SEQUENCEAREA_MODE	SWITCH_TO_INDICATORAREA_MODE	U+EBA1	MARK
INDICATORAREA_MODE	INDICATOR_BLANK	␢	HIDDEN
PRE_DEFAULT_MODE	SWITCH_TO_DEFAULT_MODE	U+EBA2	MARK
	LEVEL	77		✔
	WS	␢ (×2)	HIDDEN	✔
	FILLER	FILLER		✔
	WS	␢ (×26)	HIDDEN	✔
	PIC	PIC		✔
	WS	␢ (×1)	HIDDEN	✔
	PICSTRING	X(50)		✔
	WS	␢ (×3)	HIDDEN	✔
	VALUE	VALUE		✔
	WS	␢ (×44)	HIDDEN	✔
	SINGLEQUOTEDSTRING_START	'THIS STRIN		(other)
	SWITCH_TO_SKIPTOEOLAREA_MODE	U+EBA3	MARK
SKIPTOEOLAREA_MODE	SKIP_TO_EOL	IGNORED+\n	HIDDEN	(only \n)

	SWITCH_TO_SEQUENCEAREA_MODE	U+EBA0	MARK
SEQUENCEAREA_MODE	SEQUENCE_NUMBER	001300	HIDDEN
SEQUENCEAREA_MODE	SWITCH_TO_INDICATORAREA_MODE	U+EBA1	MARK
INDICATORAREA_MODE	INDICATOR_CONTINUATION	-	HIDDEN
PRE_CONTINUATION_MODE	SWITCH_TO_CONTINUATION_MODE	U+EBA2	MARK
	WS	␢ (×4)	HIDDEN	✔
	SINGLEQUOTEDSTRING_CONTINUATION	'G IS T... ABOVE'		(other)
	PERIOD	.		✔
	WS	␢ (×20)	HIDDEN	✔
	SWITCH_TO_SKIPTOEOLAREA_MODE	U+EBA3	MARK
SKIPTOEOLAREA_MODE	SKIP_TO_EOL	IGNORED+\n	HIDDEN	(only \n)
⋮	⋮	⋮	⋮	⋮

How does the compiler statements (like COPY) work?

This is one is tough.

Were this grammar intended for compilation, these statements would be dealt with like they were directed to some sort of preprocessor: for the COPY statement, for example, it would simple paste together the COPY LIB inside the source code (not unlike the #include directive for a C preprocessor).

However this grammar has far broader applications: for example, it could be used to syntax highlight COBOL source code in a home page, or to do static code analysis, or gather metrics on code quality, and so on. So the compiler statements cannot simply vanish (with their effects applied). They must end on the parse tree.

To accomplish this, the lexer is prepared to throw these statements into a separate channel (the COMPILE_CHANNEL). After the main source code is parsed (perhaps with missing tokens insertion) a second pass is made over this channel, to collect just these statements.

For each compiler statement, the (main) tokens immediately to the left and to the right of it are searched for. As injected missing tokens have no position, these are considered to lie to the right of our target if the token following it lies to the right of the compiler statement and is a PERIOD.

To get to the rule to insert the compiler statement, the preprocess goes:

if there is no token either to the left or to the right (that is, the compiler statement is at the very beginning or at the very end of the source), then attach to the root;
if the left neighbor is an injected missing token, then attach to its parent;
if the right neighbor is an injected missing PERIOD, then attach to its parent (rationale: one expects that a COPY compiler statement will provide the missing PERIOD);
else start with either neighbor's (both will reach the same place) parent and travel to the root looking for the first rule that properly contains the compiler statement; if the compiler.

After the correct rule to inject the compiler statement is found, a last effort is made to find the correct position to inject it, which comprises basically iterating over the rule tokens looking for the one immediately to the left. Here, again there is a special provision for injected missing tokens.

See comments spread throughout br.eti.rslemos.cobolg.Compiler class.

This project is permanently under development using this successful branching model.

The "master" branch contains only released versions (currently none).

All development occurs in the "develop" branch.

Features are developed on "feature/*" branches. On these branches the "Only the Gods" kōan maybe regularly violated, as pushed commits can be rebased at will (these branches should be temporary anyway).

BEGIN COPYRIGHT NOTICE

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

END COPYRIGHT NOTICE

Name		Name	Last commit message	Last commit date
Latest commit History 542 Commits
alpendre @ 9924c2d		alpendre @ 9924c2d
benchmarks		benchmarks
cli		cli
memory-measurer @ dc17df6		memory-measurer @ dc17df6
parser		parser
samples		samples
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
COPYING		COPYING
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cobolg

Notes about cobolg parser

How does lexer for fixed format work?

How does the compiler statements (like COPY) work?

About

Releases

Packages

Languages

License

rslemos/cobolg

Folders and files

Latest commit

History

Repository files navigation

cobolg

Notes about cobolg parser

How does lexer for fixed format work?

How does the compiler statements (like COPY) work?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages