Exercise 1-24

Write a program to check a C program for rudimentary syntax errors like unbalanced parentheses, brackets and braces. Don't forget about quotes, both single and double, escape sequences, and comments. (This program is hard if you do it in full generality.)

Keeping in mind that a valid C program should have a closing sequence for every opening sequence (e.g. {...}, "...", /* ... */, etc), we know that a valid program must have an equal number of opening and closing sequences. Therefore, every time we come across a {, for example, we increment the braces counter, and when we come across its closing sequence, }, we decrement the braces counter. We do this for each type of bracket. If the program is valid, all of these counters should be equal to zero in the end, and at no point should they be less than zero. For the two types of quotes, we use variables that switch between the constants IN and OUT. When the program stops reading input, these variables should be equal to OUT if the program is valid. We do a similar thing with comments, except for them we need to read two characters using nested if-statements. Finally, we also make sure to skip past escape sequences.

    #include <stdio.h>
    
    #define IN 1    /* inside a comment or string */
    #define OUT 0   /* outside a comment or string */
    
    main()
    {
        int c;
        int parens, brackets, braces;   /* counter for unbalanced brackets */
        int character, string, comment; /* boolean values set to IN or OUT */
    
        parens = brackets = braces = 0;
        character = string = comment = OUT;
        while ((c = getchar()) != EOF) {
            if (c == '\\') {    /* skip escape sequences */
                getchar();
                c = getchar();
            }
            if (c == '/') {
                if ((c = getchar()) == '*')
                    comment = IN;
            } else if (c == '*')
                if ((c = getchar()) == '/')
                    comment = OUT;
            if (c == '(')
                ++parens;
            else if (c == ')')
                --parens;
            else if (c == '[')
                ++brackets;
            else if (c == ']')
                --brackets;
            else if (c == '{')
                ++braces;
            else if (c == '}')
                --braces;
            else if (c == '\'') {
                if (character == IN)
                    character = OUT;
                else
                    character = IN;
            } else if (c == '"') {
                if (string == IN)
                    string = OUT;
                else
                    string = IN;
            }
            if (parens < 0) {
                printf("Error: unbalanced parentheses\n");
                parens = 0; /* avoid printing message for every character read */
            } else if (brackets < 0) {
                printf("Error: unbalanced brackets\n");
                brackets = 0;
            } else if (braces < 0) {
                printf("Error: unbalanced braces\n");
                braces = 0;
            }
        }
        if (parens != 0)
            printf("Error: unbalanced parentheses\n");
        if (brackets != 0)
            printf("Error: unbalanced brackets\n");
        if (braces != 0)
            printf("Error: unbalanced braces\n");
        if (character == IN)
            printf("Error: single quote expected\n");
        if (string == IN)
            printf("Error: double quote expected\n");
        if (comment == IN)
            printf("Error: '*/' expected\n");
        return 0;
    }

Note: we check for comments first because if the program contained a slash followed by a character that is not an asterisk, or vice versa, the second character still needs to go through the series of if-statements.

Note: notice how the if-statement that checks for a slash has braces after it. They are necessary because otherwise, the else-clause afterward would correspond to the nested if-statement instead.

However, as you might have realized already, this approach is rather inaccurate. For example, the input {(}) is considered valid, but '}' is not. Let us go back to the drawing board and approach this problem in a different way.

The most intuitive solution to this exercise involves using recursion. While it has not yet been explicitly covered in the book, the idea is fairly easy to grasp. Recursion is what happens when you call a function within its own body. One might think this would cause the compiler to throw an error, but instead, the program will enter an infinite loop. In order to avoid this, the function call needs to be wrapped in an if-statement, so that when the condition is no longer true, the loop breaks. To illustrate this, the code

    int i;
    
    for (i = 0; i < 3; ++i)
        printf("%d\n", i);
would be written as follows when using a recursive approach:
    void recursive(int i)
    {
        if (i < 3) {
            ++i;
            printf("%d\n", i);
            recursive(i);
        }
    }
    
    recursive(0);

Note: the order of the statements in recursion is of utmost importance. For example, if ++i came after the recursive function call, it would never be reached.

What if we moved the recursive call before the printf call?

    void recursive(int i)
    {
        if (i < 3) {
            ++i;
            recursive(i);
            printf("%d\n", i);
        }
    }
    
    recursive(0);
Then, we get the output
    3
    2
    1

because recursive calls itself until i is equal to three, at which point it executes the printf statement. Afterward, the program goes back to the second call of recursive (when i was equal to two) and prints i, and then the program goes to the first call of recursive (when i was equal to one) and prints i. In essence, it runs as follows, except whenever i increments, it only applies to that instance of the function call:

    void recursive(int i)
    {
        if (i < 3) {
            ++i;
            if (i < 3) {
                ++i;
                if (i < 3) {
                    ++i;
                    if (i < 3) {    /* the condition is not met */
                        ++i;
                        recursive(i);
                        printf("%d\n", i);
                    }
                    printf("%d\n", i);
                }
                /* go back to second call of recursive */
                printf("%d\n", i);
            }
            /* go back to first call of recursive */
            printf("%d\n", i);
        }
    }
    
    recursive(0);

Let us get back to the problem at hand. The first thing we do is create a function called read_block, which reads in a code block until its closing character or EOF is reached and returns the symbolic constant YES if the block is valid. If we disregard strings and comments for now, a block is valid if it meets three requirements:

  1. Its closing sequence is found
  2. There are no extra closing brackets
  3. All the blocks within it are also valid

The first two requirements are easy to implement:

    #define YES 1
    #define NO 0
    
    /* read_block:  reads a block until its closing sequence or EOF is reached;
        returns YES block is valid, NO if block is invalid */
    int read_block(int end)
    {
        int c;
    
        while ((c = getchar()) != EOF) {
            if (c == end)
                return YES;
            /* check for closing braces without a corresponding opening brace */
            else if (c == ')' || c == ']' || c == '}')
                return NO;
        }
        return NO;
    }

The first block we read in will have the closing sequence EOF; in other words, it is the entire program. Then, every time we come across an opening bracket, we call read_block recursively and assign its return value to a variable called valid, which keeps track of whether the current block is valid. Whenever a block is no longer valid, we exit the while-loop and return NO. This way, if any instance of read_block ever returns NO, all its parent blocks are rendered invalid too.

    #include <stdio.h>
    
    #define YES 1
    #define NO 0
    
    int read_block(int end);
    
    main()
    {
        int valid;  /* whether the C program is valid or not */
    
        if ((valid = read_block(EOF)) == NO)
            printf("syntax error!\n");
        printf("syntax error!\n");
    }
    
    /* read_block:  reads a block until its closing sequence or EOF is reached;
        returns YES block is valid, NO if block is invalid */
    int read_block(int end)
    {
        int c;
        int valid;  /* whether current block is valid */
    
        valid = YES;
        while ((c = getchar()) != EOF && valid == YES) {
            if (c == end)
                return YES;
            /* check for closing braces without a corresponding opening brace */
            else if (c == ')' || c == ']' || c == '}')
                return NO;
            else if (c == '(')
                valid = read_block(')');
            else if (c == '[')
                valid = read_block(']');
            else if (c == '{')
                valid = read_block('}');
        }
        if (c != EOF)
            return NO;
        else
            return valid;
    }

Note: we normally exit the while-loop when a block is not valid. The one exception is the caller in main, which will also exit the loop when the entire program is finished being read. This is the reason for the if-statement at the bottom.

Most of the hard work has now been done and adding support for strings and comments is relatively easy. We create the two functions read_cmt and read_str which when called read in all the characters up until their closing sequences. In the case of read_str, we also pass in the kind of quote as an argument and make sure to ignore escape sequences. Both functions return YES if their closing sequence is found, and NO if they read in an EOF. Similar to the brackets, the functions get called when their opening sequences are found in read_block, and their return value is assigned to valid.

    #include <stdio.h>
    
    #define YES 1
    #define NO 0
    
    int read_block(int end);
    int read_str(char quote);
    int read_cmt(void);
    
    main()
    {
        int valid;  /* whether the C program is valid or not */
    
        if ((valid = read_block(EOF)) == NO)
            printf("syntax error!\n");
        return 0;
    }
    
    /* read_block:  reads a block until its closing sequence or EOF is reached;
        returns YES block is valid, NO if block is invalid */
    int read_block(int end)
    {
        int c;
        int valid;  /* whether current block is valid */
    
        valid = YES;
        while ((c = getchar()) != EOF && valid == YES) {
            if (c == '/') {
                c = getchar();
                if (c == '*')
                    if ((valid = read_cmt()) == YES)
                        c = getchar();
            }
            if (c == end)
                return YES;
            /* check for closing braces without a corresponding opening brace */
            else if (c == ')' || c == ']' || c == '}')
                return NO;
            else if (c == '(')
                valid = read_block(')');
            else if (c == '[')
                valid = read_block(']');
            else if (c == '{')
                valid = read_block('}');
            else if (c == '\'')
                valid = read_str('\'');
            else if (c == '"')
                valid = read_str('"');
        }
        if (c != EOF)
            return NO;
        else
            return valid;
    }
    
    /* read_str:  reads a quoted string and returns YES if it terminates */
    int read_str(char quote)
    {
        int c;
    
        while ((c = getchar()) != EOF) {
            if (c == quote)
                return YES;
            /* ignore second character in escape sequence */
            if (c == '\\')
                getchar();
        }
        return NO; /* end quote missing */
    }
    
    /* read_cmt:  reads a comment and returns YES if it terminates */
    int read_cmt(void)
    {
        int c;
    
        while ((c = getchar()) != EOF)  
            if (c == '*')
                if ((c = getchar()) == '/')
                    return YES;
        return NO; /* comment never terminates */
    }