Exercise 1-23

Write a program to remove all comments from a C program. Don't forget to handle quoted strings and character constants properly. C comments do not nest.

We start with the following code to copy inputted text to the terminal:

    #include <stdio.h>
    
    main()
    {
        int c;
    
        while ((c = getchar()) != EOF)
            putchar(c);
        return 0;
    }

The token /* marks the start of a comment and the token */ indicates the end of a comment. We create the variable state which will be assigned the symbolic constants IN and OUT to keep track of whether we are "inside" or "outside" of a comment. When we are inside a comment, we skip past all the characters until we reach the end, at which point we continue copying characters to the terminal.

To implement this, when we come across a slash, we check if the next character is an asterisk; if it is not, we print the slash and continue copying characters. Otherwise, we set state to IN and enter a while-loop that runs until state changes to OUT. This will occur when we read in an asterisk followed by a slash. Once we exit the while-loop, we read in the next character.

    #include <stdio.h>
    
    #define IN 1    /* inside a comment */
    #define OUT 0   /* outside a comment */
    
    main()
    {
        int c;
        int state;  /* whether inside or outside of a comment */
    
        state = 0;
        while ((c = getchar()) != EOF) {
            if (c == '/') {
                c = getchar();
                if (c == '*') {
                    state = IN;
                    while (state == IN) {
                        c = getchar();
                        if (c == '*') {
                            c = getchar();
                            if (c == '/')
                                state = OUT;
                        }
                    }
                    c = getchar();
                } else
                    putchar('/');
            }
            putchar(c);
        }
        return 0;
    }

There is one more thing we have not accounted for: string constants. For example, if our program reads in "/*", it will treat the /* as the start of a comment despite it being part of a character string. To avoid this, when we come across a double quote, we print everything up until the closing quote, ignoring what is inside the character string. We also have to be wary of escape sequences like \", so when we come across a backslash, we print the second character of the escape sequence and assign the character afterward to c, essentially ignoring the escape sequence. Finally, when we read in and print a single quote, we subsequently read in and print the rest of the literal (which is usually two more characters, but three more when the literal is an escape sequence) before moving on to the next iteration of the loop. This is to avoid issues with the character literals '"' and '\"'.

    #include <stdio.h>
    
    #define IN 1    /* inside a comment */
    #define OUT 0   /* outside a comment */
    
    main()
    {
        int c;
        int state;  /* whether inside or outside of a comment */
    
        state = OUT;
        while ((c = getchar()) != EOF) {
            if (c == '/') {
                c = getchar();
                if (c == '*') {
                    state = IN;
                    while (state == IN) {
                        c = getchar();
                        if (c == '*') {
                            c = getchar();
                            if (c == '/')
                                state = OUT;
                        }
                    }
                } else
                    putchar('/');
            }
            if (c == '"') {
                putchar(c);
                while ((c = getchar()) != '"') {
                    putchar(c);
                    /* ignore second character of escape sequence */
                    if (c == '\\')
                        putchar(getchar());
                }
            }
            putchar(c);
            if (c == '\'') {    /* avoid treating '"' as start of string */
                if ((c = getchar()) == '\\') {
                    putchar(c);
                    c = getchar();
                }
                c = getchar();
                putchar(c);
                putchar(getchar());
            }
        }
        return 0;
    }

Note: the statement putchar(getchar()) might look strange at first, but it is very much valid because getchar returns—and is therefore equal to the next character in the input stream. We are simply passing that value as an argument for putchar.

Note: notice how we do not use an "else-if" statement when checking for double-quotes. This is by design. To understand why, try testing the input /* comment */"/* string" with and without the else-clause. Do you see why the behavior changes?