Building a PL/SQL Code Parser (using PL/SQL), Part 1
Is it possible to build a PL/SQL parser using nothing but PL/SQL? To answer that question, I guess I need to define “PL/SQL parser”. What would be the intention of this parser?
I would like a way to parse a PL/SQL code block and let me determine some statistics: number of lines, number of keywords, number of DML statements. It would need to recognize comments. I also want to be able to format the code for pretty printing. I want to be able to apply user defined checks for adherence to coding standards. Finally, I want to be able to pull the code from the data dictionary.
Possible? I think so. There are languages that would make this an easier task but I want to use PL/SQL. In addition to being a useful tool, I think a parser like this would make a fun project. So that’s what I’m going to do.
PL/SQL is a HUGE language. The number of keywords and all of the possible combinations are daunting. Throw in SQL (which is huge in it’s own right) and a parser is just short of nightmarish. Fun, huh!
Like any problem, if we think about how hard it is, we’ll never get started. So, I am going to reduce the task a bit and give myself a starting point. I’m going to give myself some base rules to begin with. I can add to these rules later.
PL/SQL code is made up of four components Keywords, Operators, Comments and Data. Each component is separated by a delimiter. Data may be literals and variables but data may also be non-keyword function and procedure calls. Keywords are any command or instruction that is part of the language.
To make it easy to get started, we are going to say that anything surrounded by single quotes is user data. There are other sting delimiters but I am going to ignore those for now. Also, inside of a quoted string, all keywords, comments and operators are ignored. So, ‘lalala BEGIN lalala’ = DATA. ‘lalala’ BEGIN ‘lalala’ = DATA KEYWORD DATA.
Comments begin with — or /*. // ends with the EOL. /* ends with */. Everything inside of a comment is ignored.
We assume that the code is valid and compiled. We will pull the code from the data dictionary rather than from a source file.
We assume that the only keywords that we need to worry about are these: ‘BEGIN’, ‘PROCEDURE’, ‘FUNCTION’, ‘DECLARE’, ‘SELECT’, ‘FROM’, ‘TYPE’, ‘IS’, ‘TABLE’, ‘OF’, ‘VARCHAR2′, ‘AS’, ‘PACKAGE’, ‘BODY’, ‘LOOP’, ‘END, ‘RETURN’, ‘IN’, ‘FOR’, ‘..’, ‘%’, ‘END’, ‘EXCEPTION’, ‘WHEN’, ‘IF’, ‘THEN’, ‘%TYPE’.
Now I begin my coding. I’m going to use multiple procedures and functions so I will create a package. I’ll start with a shell package and a simple line parser. To start with, I am just going to write a parser that will extract quoted strings. That will prove the concept and give me a starting place for my package.
Amazon uses a now famous approach to developing software, start at the customer and work backwards. I really like that approach. I will start with a sample procedure and write a parser that will parse that.
The sample code:
CREATE OR REPLACE PROCEDURE yada AS BEGIN DBMS_OUTPUT.PUT_LINE('string with a space' || 'string with emb''edd''''ed quotes'); END;
Very simple. It is a dbms_output of two strings concatenated together.
So, my first pass of the parser will just extract the two strings that exist in the procedure. I plan to populate a record type eventually so I will make my line parser return a value. Since I have not defined my record type, I will just return a varchar2 for now. The function will accept a string and parse it into its components.
Eventually, this code will pull a procedure from the data dictionary. For now, I will just manually pass in the code I want parsed.
Here is the package spec:
create or replace PACKAGE lrc_plsql_parser AS FUNCTION parse_line ( p_string IN VARCHAR2 ) RETURN VARCHAR2; END lrc_plsql_parser;
And here is the body:
create or replace PACKAGE BODY lrc_plsql_parser AS TYPE a_keys IS TABLE OF VARCHAR2(32000); -- Strings identified as pl/sql quotes v_quotes a_keys := a_keys(''''); -- Function to test for break conditions FUNCTION is_a_break ( p_string IN VARCHAR2, p_break_type IN VARCHAR2 ) RETURN BOOLEAN; -- Main line parers FUNCTION parse_line ( p_string IN VARCHAR2 ) RETURN VARCHAR2 IS -- Dummy return value, not currently used v_string VARCHAR2(32000); -- Working string, holds intermediate results v_working_data VARCHAR2(32000); -- Easier to work on upper case v_upper_p VARCHAR2(32000) := UPPER(p_string) || ' '; -- Current character in buffer v_char VARCHAR2(1); -- Next character in buffer v_char_peek VARCHAR2(1); -- Are we currently working on a string l_quotes BOOLEAN := FALSE; -- How many quote chars have been found l_quote_count INTEGER := 0; BEGIN -- Display the output DBMS_OUTPUT.PUT_LINE( 'Input: ' || v_upper_p ); -- Loop through the string line by line FOR i IN 1..length(v_upper_p) LOOP v_char := SUBSTR(v_upper_p, i, 1); v_char_peek := SUBSTR(v_upper_p, i+1, 1); CASE -- When the current character is a quote WHEN is_a_break(v_char, 'QUOTE') THEN CASE -- If we are in the midst of a string and the next char is also a quote WHEN l_quotes AND is_a_break(v_char_peek, 'QUOTE') THEN -- Continue getting data v_working_data := v_working_data || v_char; -- Undo the nested quote count l_quote_count := l_quote_count - 1; -- If we are in the midst of a string and the next char is NOT a quote -- AND the number of quotes so far is not even (this one would make it even) WHEN l_quotes AND NOT is_a_break(v_char_peek, 'QUOTE') AND mod(l_quote_count,2) != 0 THEN -- Continue getting data v_working_data := v_working_data || v_char; -- Write out a string DBMS_OUTPUT.PUT_LINE('Data: ' || v_working_data ); -- Clear the working buffer v_working_data := NULL; -- Turn off quote mode l_quotes := FALSE; -- Reset quote count l_quote_count := 0; -- If we are in the midst of a string and the next char is NOT a quote -- AND the number of quotes so far is even (this one would make it odd) WHEN l_quotes AND NOT is_a_break(v_char_peek, 'QUOTE') AND mod(l_quote_count,2) = 0 THEN -- Undo the nested quote count l_quote_count := l_quote_count - 1; -- Continue getting data v_working_data := v_working_data || v_char; WHEN NOT l_quotes THEN -- Increment quote count l_quote_count := l_quote_count + 1; -- Start up quote mode l_quotes := TRUE; -- Start the working string v_working_data := v_char; ELSE -- Continue getting data v_working_data := v_working_data || v_char; END CASE; ELSE -- Default case, most visited piece of code -- Continue getting data v_working_data := v_working_data || v_char; END CASE; END LOOP; RETURN v_string; END; FUNCTION is_a_break ( p_string IN VARCHAR2, p_break_type IN VARCHAR2 ) RETURN BOOLEAN AS BEGIN CASE WHEN p_break_type = 'QUOTE' THEN FOR i IN v_quotes.FIRST..v_quotes.LAST LOOP IF v_quotes(i) = p_string THEN RETURN TRUE; END IF; END LOOP; ELSE NULL; END CASE; RETURN FALSE; END is_a_break; END lrc_plsql_parser;
The code is not that complex (yet). Remember this is just to extract the strings.
I have a simple anonymous block to test the procedure (quotes are doubled on purpose):
SET SERVEROUTPUT ON SIZE UNLIMITED DECLARE v_string VARCHAR2(32000); BEGIN v_string := lrc_plsql_parser.parse_line( 'CREATE PROCEDURE yada AS BEGIN DBMS_OUTPUT.PUT_LINE(' || ' ''string with a space''|| ''string with ' || ' emb''''edd''''''''ed quotes''); ' || 'END;'); END;
The results look like this:
Input: CREATE PROCEDURE YADA AS BEGIN DBMS_OUTPUT.PUT_LINE('STRING WITH A SPACE'|| 'STRING WITH EMB''EDD''''ED QUOTES'); END; Data: 'STRING WITH A SPACE' Data: 'STRING WITH EMB''EDD''''ED QUOTES' PL/SQL procedure successfully completed. SQL>
Not bad for a couple of hours on a saturday. Next will be adding comment extraction and rearranging the code to make it a bit cleaner.
You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.