Building a PL/SQL Code Parser (using PL/SQL), Part 1

LewisC’s An Expert’s Guide to Oracle Technology

Is it possible to build a PL/SQL parser using nothing but PL/SQL? To answer that question, I guess I need to define “PL/SQL parser”. What would be the intention of this parser?

I would like a way to parse a PL/SQL code block and let me determine some statistics: number of lines, number of keywords, number of DML statements. It would need to recognize comments. I also want to be able to format the code for pretty printing. I want to be able to apply user defined checks for adherence to coding standards. Finally, I want to be able to pull the code from the data dictionary.

Possible? I think so. There are languages that would make this an easier task but I want to use PL/SQL. In addition to being a useful tool, I think a parser like this would make a fun project. So that’s what I’m going to do.

Some Assumptions

PL/SQL is a HUGE language. The number of keywords and all of the possible combinations are daunting. Throw in SQL (which is huge in it’s own right) and a parser is just short of nightmarish. Fun, huh!

Like any problem, if we think about how hard it is, we’ll never get started. So, I am going to reduce the task a bit and give myself a starting point. I’m going to give myself some base rules to begin with. I can add to these rules later.

PL/SQL code is made up of four components Keywords, Operators, Comments and Data. Each component is separated by a delimiter. Data may be literals and variables but data may also be non-keyword function and procedure calls. Keywords are any command or instruction that is part of the language.

To make it easy to get started, we are going to say that anything surrounded by single quotes is user data. There are other sting delimiters but I am going to ignore those for now. Also, inside of a quoted string, all keywords, comments and operators are ignored. So, ‘lalala BEGIN lalala’ = DATA. ‘lalala’ BEGIN ‘lalala’ = DATA KEYWORD DATA.

Comments begin with — or /*. // ends with the EOL. /* ends with */. Everything inside of a comment is ignored.

We assume that the code is valid and compiled. We will pull the code from the data dictionary rather than from a source file.

We assume that the only keywords that we need to worry about are these: ‘BEGIN’, ‘PROCEDURE’, ‘FUNCTION’, ‘DECLARE’, ‘SELECT’, ‘FROM’, ‘TYPE’, ‘IS’, ‘TABLE’, ‘OF’, ‘VARCHAR2′, ‘AS’, ‘PACKAGE’, ‘BODY’, ‘LOOP’, ‘END, ‘RETURN’, ‘IN’, ‘FOR’, ‘..’, ‘%’, ‘END’, ‘EXCEPTION’, ‘WHEN’, ‘IF’, ‘THEN’, ‘%TYPE’.

The code:

Now I begin my coding. I’m going to use multiple procedures and functions so I will create a package. I’ll start with a shell package and a simple line parser. To start with, I am just going to write a parser that will extract quoted strings. That will prove the concept and give me a starting place for my package.

Amazon uses a now famous approach to developing software, start at the customer and work backwards. I really like that approach. I will start with a sample procedure and write a parser that will parse that.

The sample code:

CREATE OR REPLACE PROCEDURE yada
AS
BEGIN
  DBMS_OUTPUT.PUT_LINE('string with a space' || 'string with emb''edd''''ed quotes');
END;

Very simple. It is a dbms_output of two strings concatenated together.

So, my first pass of the parser will just extract the two strings that exist in the procedure. I plan to populate a record type eventually so I will make my line parser return a value. Since I have not defined my record type, I will just return a varchar2 for now. The function will accept a string and parse it into its components.

Eventually, this code will pull a procedure from the data dictionary. For now, I will just manually pass in the code I want parsed.

Here is the package spec:

create or replace PACKAGE lrc_plsql_parser
AS

  FUNCTION parse_line
    (
      p_string IN VARCHAR2 )
    RETURN VARCHAR2;

END lrc_plsql_parser;

And here is the body:

create or replace PACKAGE BODY lrc_plsql_parser
AS

  TYPE a_keys IS TABLE OF VARCHAR2(32000);

  -- Strings identified as pl/sql quotes
  v_quotes a_keys := a_keys('''');

  -- Function to test for break conditions
  FUNCTION is_a_break
  (
    p_string     IN VARCHAR2,
    p_break_type IN VARCHAR2 )
  RETURN BOOLEAN;

  -- Main line parers
  FUNCTION parse_line
  (
    p_string IN VARCHAR2 )
  RETURN VARCHAR2
  IS
    -- Dummy return value, not currently used
    v_string VARCHAR2(32000);
    -- Working string, holds intermediate results
    v_working_data VARCHAR2(32000);
    -- Easier to work on upper case
    v_upper_p      VARCHAR2(32000) := UPPER(p_string) || ' ';
    -- Current character in buffer
    v_char         VARCHAR2(1);
    -- Next character in buffer
    v_char_peek    VARCHAR2(1);
    -- Are we currently working on a string
    l_quotes    BOOLEAN := FALSE;
    -- How many quote chars have been found
    l_quote_count INTEGER := 0;
  BEGIN

    -- Display the output
    DBMS_OUTPUT.PUT_LINE( 'Input: ' || v_upper_p );

    -- Loop through the string line by line
    FOR i IN 1..length(v_upper_p)
    LOOP
      v_char      := SUBSTR(v_upper_p, i, 1);
      v_char_peek := SUBSTR(v_upper_p, i+1, 1);

      CASE
      -- When the current character is a quote
      WHEN is_a_break(v_char, 'QUOTE')
      THEN

        CASE
        -- If we are in the midst of a string and the next char is also a quote
        WHEN l_quotes AND
          is_a_break(v_char_peek, 'QUOTE')
        THEN
          -- Continue getting data
          v_working_data := v_working_data || v_char;
          -- Undo the nested quote count
          l_quote_count := l_quote_count - 1;

        -- If we are in the midst of a string and the next char is NOT a quote
        -- AND the number of quotes so far is not even (this one would make it even)
        WHEN l_quotes AND
         NOT is_a_break(v_char_peek, 'QUOTE')
         AND mod(l_quote_count,2) != 0
        THEN

          -- Continue getting data
          v_working_data := v_working_data || v_char;
          -- Write out a string
          DBMS_OUTPUT.PUT_LINE('Data: ' || v_working_data );
          -- Clear the working buffer
          v_working_data := NULL;
          -- Turn off quote mode
          l_quotes := FALSE;
          -- Reset quote count
          l_quote_count := 0;

        -- If we are in the midst of a string and the next char is NOT a quote
        -- AND the number of quotes so far is even (this one would make it odd)
        WHEN l_quotes AND
           NOT is_a_break(v_char_peek, 'QUOTE')
           AND mod(l_quote_count,2) = 0
        THEN

          -- Undo the nested quote count
          l_quote_count := l_quote_count - 1;
          -- Continue getting data
          v_working_data := v_working_data || v_char;
        WHEN NOT l_quotes
        THEN

          -- Increment quote count
          l_quote_count := l_quote_count + 1;
          -- Start up quote mode
          l_quotes := TRUE;
          -- Start the working string
          v_working_data := v_char;

        ELSE

          -- Continue getting data
          v_working_data := v_working_data || v_char;
        END CASE;

      ELSE  -- Default case, most visited piece of code
        -- Continue getting data
        v_working_data := v_working_data || v_char;

      END CASE;

    END LOOP;

    RETURN v_string;

  END;

  FUNCTION is_a_break
  (
    p_string     IN VARCHAR2,
    p_break_type IN VARCHAR2 )
  RETURN BOOLEAN
  AS
  BEGIN
   CASE
    WHEN p_break_type = 'QUOTE' THEN
      FOR i          IN v_quotes.FIRST..v_quotes.LAST
      LOOP
        IF v_quotes(i) = p_string THEN
          RETURN TRUE;
        END IF;
      END LOOP;
    ELSE
      NULL;
    END CASE;
    RETURN FALSE;
  END is_a_break;

END lrc_plsql_parser;

The code is not that complex (yet). Remember this is just to extract the strings.

I have a simple anonymous block to test the procedure (quotes are doubled on purpose):

SET SERVEROUTPUT ON SIZE UNLIMITED

DECLARE
  v_string VARCHAR2(32000);
BEGIN
  v_string :=
    lrc_plsql_parser.parse_line(
          'CREATE PROCEDURE yada AS BEGIN DBMS_OUTPUT.PUT_LINE('  ||
          ' ''string with a space''|| ''string with ' ||
          ' emb''''edd''''''''ed quotes''); ' ||
          'END;');
END;

The results look like this:

Input: CREATE PROCEDURE YADA AS BEGIN DBMS_OUTPUT.PUT_LINE('STRING WITH A
SPACE'|| 'STRING WITH EMB''EDD''''ED QUOTES'); END;
Data: 'STRING WITH A SPACE'
Data: 'STRING WITH EMB''EDD''''ED QUOTES'

PL/SQL procedure successfully completed.

SQL>

Not bad for a couple of hours on a saturday. Next will be adding comment extraction and rearranging the code to make it a bit cleaner.

LewisC

Technorati : , , , ,

You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

2 Comments »