Regular Expressions in Java

Package com.stevesoft.pat version 1.5.3

Home
Articles/Links
Mugs, T-shirts Comments/Raves
New in 1.5.3
A Game
An Online Test
Questions

Copyright/License
Download Free

 If you need a non-LGPL version
You Can Buy!

Online help...
Quick Start
Tutorial Part 1
Tutorial Part 2
Tutorial Part 3
Tutorial Part 4
Tutorial Part 5
Tutorial Part 6
Examples
Support
FAQ
Documentation

Useful apps...
Java Beautifier
Code Colorizer
GUI Grep
Swing Grep

Other stuff...
Phreida
xmlser

Writing a Java Beautifier for JBuilder

Intro:

I was naturally pleased when Borland expressed interest in using my regular expression software in their excellent JBuilder product to make a wizard to "beautify" java. source code. By beautify, of course, I mean indent the code in such a way as to best reveal its logical structure. This can aid the reading of code by others, as well as expose logical flaws that may be present in the code. Those of you are familiar with UNIX will remember the cb command ("C beautifier") written around 1980 by Lorinda Cherry that does something similar for code written in C. The cb command was an advancement over previous "pretty printers" of its day in that it produced valid code that could be compiled - something that was pretty for the compiler as well as the human eye.


The JavaBeautifier that I will describe here follows in her tradition. I will describe the writing of a JavaBeautifier for JBuilder in two parts.

The first part will describe how I hooked my code into JBuilder. It will provide sufficiently complete information so that other developers could take their own application and hook it into JBuilder. I will not attempt to describe the full set of options available to someone working on this sort of project, however. I will limit my discussion to what is relevant for this application. For the purpose of this part of the discussion the JavaBeautifier is simply a black box, a piece of code that takes an InputStream, reads it, transforms it, and writes it to an OutputStream.

The second part will focus on the inner logic of the JavaBeautifier itself. This section will also have several parts. The first part (2 a) will describe regular expressions, what they are, what sorts of things they are useful for, in general, and what they were used for in this project. In short, regular expressions are an extremely concise and versatile method for finding patterns in text and extracting information from the matched portion of text. The second part (2 b, and 2 c) will describe an outline of the other logic used by the beautifier.

Part 1: Sorcerer's Apprenticeship:
1 a: Getting on the List of JBuilder Wizards:

Getting a product loaded into JBuilder is quite straightforward. During its startup phase it looks for classes to load by scanning the jar files in its classpath. It searches the manifest file of the jar for a line telling it that a Wizard is present. The line that is needed in this case is:

  OpenTools-Wizard: com.stevesoft.jbeaut.JBeautWizardd

When it finds this entry, it knows to call the method initOpenTool() that is provided the JBeautWizard class. A first attempt at using this functionality is illustrated below.

public static void
  initOpenTool(int major, int minor) {
    System.out.println("Abra Cadabra."); }

When we start JBuilder we will now see the message "Abra Cadabra." So far so good, now to do something more useful.

To make a Wizard appear on the Wizard menu item, I first needed to define a WizardAction. The WizardAction needed to have three methods: A constructor, an update(), and a createWizard() method. The constructor will set the name of the Wizard, the short-cut key, and a help message. It will also need to tell JBuilder whether or not this is to be a "Gallery Wizard." A gallery wizard shows up when you click on the File menu and select "New." Since JavaBeautifier will not create something new but merely process an existing java source code file we don't want to make a gallery wizard. Our wizard will show up under the "Wizards" menu. You can see how this was done in Listing 1.

Now that we have a WizardAction we can make a more useful call to initOpenTool().

public static void
  initOpenTool(int major, int minor) {
  WizardManager
    .registerWizardAction(new JBeautAction()); }

We don't want the menu item for the JavaBeautifier to be always enabled. Rather, we want it enabled when the current file is a java source file and is writeable. The update() method is called whenever the "Wizard" menu is displayed, so this is where we put the logic to determine whether or not to enable the JavaBeautifier. There should not be a long delay when a user clicks on this menu, so any update() method should be small and fast. You can see the details of how this was implemented in Listing 2.

Finally, we need to define the createWizard() method. All this requires is to create a JBeautWizard object with a null constructor and return it. The interesting part is in the next steps: creating a WizardPage to obtain parameters from the user and in creating the Wizard itself.

1 b: The JBeautWizardPage:

The JBeautWizardPage is an extension of a class called BasicWizardPage. Basically, this Object will provide a GUI interface for the user. Upon initialization, all the relevant components and their layouts need to be set. I won't go into the details of this part. What I will discuss is the parameters that are obtained from the user from this page. We need to know two things about them: how to make sure they are sane, and how to remember the values of these parameters between invocations.

To check for valid parameters one needs to define a method called checkPage(), which is called to make sure the options to the beautifier are not insane (i.e. a negative indentation size). If it detects a problem it displays an error message and throws a VetoException. See Listing 4.

To make parameters persist between invocations of the JBeautWizardPage we need to store them in the node via the NodeProperties class. Each NodeProperty has a category, a property name, and value. These are provided in order in the constructor. A good place to save the properties to your project is at the end of checkPage(). At that point the user has finished inputting values and the values have passed a sanity check.

1 c: The JBeautWizard:

The wizard has two methods that need to be defined. The first is the constructor. Here it adds the JBeautWizardPage that will be used to obtain options from the user. The second is the finish() method. The finish() method is invoked when the user clicks the "Finish" button on the JBeautWizardPage.

In the constructor one can set the title (as it appears on the title bar on the top of the JBeautWizardPage). This title will appear above each WizardPage in the Wizard. One can also instantiate an instance of the JBeautWizardPage and call the addWizardPage() method to associate it with this instance of the Wizard. In principle one could add any number of such pages to the Wizard. At the moment, the JavaBeautifier has just a single page.

The code for the finish() method (see listing 5) shows you how to obtain a Node, then open an InputStream and an OutputStream for that node. Once you have these, you can call the JavaBeautifier.

2: Beautifying the code:
2 a: Breaking the source code up into pieces

Parsing a java source file is at least moderately tricky, one must identify comments, curly braces, parenthesis and the like, yet not get confused when these same text patterns occur within String constants or comments. A curly brace has no relevance to indentation inside a comment. The key simplification here is to break the code down into tokens (i.e. small logical units of code. Here are some examples of the tokens I identified: a keyword, a variable name, an operator, a comment, etc.) and analyze the tokens. Typically one uses what is called a regular expression to do this, which is a kind of cryptic code that describes a pattern of text in a rather concise manner. Regular expressions look a bit scary to the uninitiated, but after you get the hang of it you can't imagine life without them. Well, okay, you have to be a bit of a geek to feel quite that strongly. Let's just say they make your programming tasks easier and you'll find use for them in any number of applications.

I am going to provide a few examples of regular expressions so that you can get an idea of what they look like and what they can do.

PATTERNMATCHES...
\da digit
\d+one or more digits
\d\d/\d\d/\d{4}a date of the format MM/DD/YYYY
\sa white space (' ', '\b', '\t', '\r', '\n')
\s*zero or more white spaces
\(\d{3}\)\s*\d{3}-\d{4}a 10-digit phone number
.anything but a \n
//.*a comment
/\*.*?\*/the other kind of comment
(a|b)either the letter 'a' or the letter 'b'

With a tool like this you can write something that searches for phone numbers within a text file in two lines of code -- one line to compile the pattern and one line to call search(). Rather than going to go into the gory details of how to make a regular expression here, I'll point you to a short tutorial on using the regular expression compiler (written in Java) that I used for the Java Beautifier I built for JBuilder. It can be found at http://javaregex.com/pat/tutorial/tutorial.html. You might also wish to check out other sites where regular expressions are discussed such as http://www.perl.com.

In the JavaBeautifier I simply supplied the regular expression compiler with a set of patterns, one for each type of token I wished to identify in Java source code (i.e. quoted string, comment, operator, word, etc.). The regular expression search method returns each time it finds one of these, whichever starts first. This means that I won't be confused by comments appearing inside quoted strings or vice-versa. If I encounter a line of code like the following

do_something(); // "hello"

the '// "hello"' will parse as a single comment token. The pattern '//.*' will match starting at an earlier position within the text than the pattern for the quoted string. Likewise, this line of code

out.println("// generate a comment")

It will identify the quoted string "// generate a comment" as a quoted string and won't be confused into seeing a comment. The pattern for the quoted string will match starting at an earlier position, will match the whole string, and the pattern matcher will resume looking through the text at a position after the completed match.

2 b: Arranging the pieces

In some sense everything I've discussed up to this point has been nothing more than preparation for beautifying the code. The real work lies ahead. The goal in what follows is to try and figure out the simplest most general rules to get the indentation right. This is, of course, the tricky part and despite several attempts at reworking this JavaBeautifier it is still a bit more complex than I'd hoped.

There are a number of issues to tackle in getting the beautifier to produce correct indentation. For example, blocks of code following conditional statements may or may not be enclosed in curly brackets. For example we may encounter code that looks like this:

if(a==2) {
  foo();
}
or this
if(a==2)
  foo();

In order to properly indent the code we need to figure out where curly brackets would be placed if the user had typed them. Moreover, we need to correctly associate else blocks with if blocks and not get confused if the user throws comments in like this (see Listing 4).

In order to make it easier to process the code, I made it possible to view the tokens in two ways. The first is simply a linear list. The second is a hierarchical arrangement of lists and sub-lists. We want to arrange things so that the Token object for each "(" or "{" contains as a sub-list with all the Tokens between it and the corresponding ")" or "}". This is straightforward. Also, to avoid clutter, I decided that white-space and comments would go on sub-lists of whatever token preceded them.

After this arranging has been done, I can scan through the list of Tokens for "if" blocks. If I encounter an "if" token at position "n" I know that the execution block for this "if" begins at Token "n+3". If I find a "{" token there I know that the programmer explicitly delimited his/her block with curly brackets. In fact, I know that the sequence of Tokens will simply be "if", "(", ")", "{" in this case with all the spaces, comments, etc. folded up into sub-lists of these tokens.

When I find a place where an if is encountered I insert a special START_BLOCK token which doesn't print but which I can identify as a place where "{" or "}" could've been put. When I find an "if" block and an "else" block together I fold these up into a single token as well. In this manner I've made the process of getting the if/else blocks to format properly whether or not they explicitly use curly brackets.

2 c: The Output

Some of the information we need to do the formatting is not known until we have actually indented and printed out the code preceding it. For example, if I encounter a parenthesis I want the following code to indent to the column that the parenthesis was printed on plus one -- but I can't know where it is until it has actually printed. This means that the column we indent to needs to be saved to a Stack. We push a value on when we encounter a "(" and pop the value back when we encounter a ")".

Another situation that we can encounter during output is long lines -- possibly the line became too long as a result of proper indenting. In this circumstance we need to insert a carriage return into the line -- and while the compiler is happy no matter where we insert the carriage return (so long as it is between tokens) all locations are not as easy on the eyes.

This is where some of our previous work comes in useful. The START_BLOCK token I mentioned before is one such place. If we can insert a carriage return token after that, then that is the location the carriage return goes. At present the JavaBeautifier does not work too hard to shorten lines, it sticks to a few basic rules that will not produce ugly code.

There are, of course, other issues and special situations that each have their formatting needs. However, the goal here was just to provide a sketch of the logic used to write the JavaBeautifier -- not to provide a complete and detailed description.

3: Conclusion and Summary

Now that we have a good grasp of the pieces, a whirlwind high-level summary of what has been done might be in order to have a good map of where we've been. The first phase, hooking the JavaBeautifier into JBuilder consisted of creating three objects: JBeautAction, JBeautWizardPage, JBeautWizard.

The first, JBeautAction, implements the sub-menu item underneath the "Wizards" menu. It provides a facility to enable or disable the beautifier with its update() method.

The second, JBeautWizardPage, implements the GUI, the dialog with the user where parameters for the JavaBeautifier are supplied. It obtains initial values for its options from the project, queries the user to make sure these options are acceptable, then saves the options. The third, JBeautWizard, calls the JavaBeautifier and applies it to the current Node.

Next, the finish() method of the JBeautWizard is called. It obtains the current Node and opens input and output streams to it. Finally, the JavaBeautifier takes over. It first uses a regular expression to break the code down into a set of tokens. It then arranges these tokens in hierarchical lists to simplify the finding of patterns within those tokens. Lastly, it writes out the code, applying formatting options as the output is done.

You can see a screen shot of the JavaBeautifier Wizardís dialog, along with some of the code that it beautified in Fig. BRAN-1.bmp. The screen shot was taken on a Windows NT machine, running JBuilder 3, Enterprise Edition, 3.0.319.0. I hope you enjoy using the JavaBeautifier Wizard and in writing your own Wizards for JBuilder.

4: Acknowledgments

I wish to thank Edwin Desouza for inviting me to write this article, Mike Timbol for his helpful advice (he actually wrote the first version of the sections of code used to hook the JavaBeautifier into JBuilder), and finally my wife who had to do a larger share of the Thanksgiving preparations while I worked on this project.


A screen shot: 108K Image

Using JavaBeautifier in JBuilder

Using JavaBeautifier from the command line.

Listing 1:
public class JBeautWizardAction
  extends WizardAction {

  public JBeautWizardAction() {
    super("Java Beautifier", /* what appears on
                              * the "Wizard" menu
                              */
          'B', // Short-cut key
          "Reformat the active "+
            "java source file."); // Help message
    setGalleryWizard(false);
  }
  ....
Listing 2:
// Only enable the "Java Beautifier..."
// menu item if the current
// file is a java source file and if we
// have write permission.
public void update(Object source) {
    // Browser refers to the JBuilder
    // program itself, not any
    // other kind of Browser.
    Browser b = Browser.findBrowser(source);
    if(b != null) {
        // The JBuilder system refers to a file, 
        // whether on disk or inside
        // a zip file, etc., as a "Node".
        Node n = b.getActiveNode();
        If(n instanceof JavaFileNode) {
            JavaFileNode jfn = (JavaFileNode)n;
            if(!jfn.isReadOnly())
                setEnabled(true);
        }
    }
    setEnabled(false);
}
Listing 3:
public JBeautWizardPage extends BasicWizardPage {
  Project project;
  int tabSize;
  public JBeautWizardPage() {
    // Set basic information for the page
    setPageStyle(STYLE_REGULAR);
    setPageTitle("Set the options for "+
                 "the Java Beautifier");
    setInstructions("some brief "+
                    "instructions.... ");

    // Retrieve the settings used last time
    Browser b = Browser.findBrowser(this);
    project = b.getProjectView()
      .getActiveProject();
    tabSize = TAB_SIZE.getValue(project);
    ....
  }
  ....
  public final String CATEGORY =
    "JavaBeautifier";
  public final NodeProperty TAB_SIZE =
    new NodeProperty(CATEGORY,"tabSize","4");
  public void checkPage() throws VetoException {
    ....
    if(tabSize < 1 || tabSize > 16) {
      // This will display an error message
      // with an OK button
      // for the user to select.
      JoptionPane
        .showMessageDialog(wizardHost
                     .getDialogParent(),
                     "Error:  Invalid tab size.",
                     "Java Beautifier Error",
                     JoptionPane.ERROR_MESSAGE);
            throw new VetoException();
        }
        TAB_SIZE.setValue(project,"tab_size");
    }
    ....
Listing 4:
if(a==3)
  //  This is a comment
  if(b==2)
    Do_task_1();
    // This is also a comment
  else {
    int x = my_func();
    Do_task_2(x);
  }
Listing 5:
void finish() throws VetoException {
  try {
    JavaFileNode node =
      (JavaFileNode)wizardHost
      .getBrowser()
      .getActiveNode();
    InputStream iStream =
      node.getInputStream();
    String encoding = node.getEncoding();
    InputStreamReader inputReader =
      encoding !=
      null ? new InputStreamReader(iStream,
                                   encoding)
           : new InputStreamReader(iStream);
    // Note the call to getBuffer() here --
    // without this call one
    // could not undo the changes made
    // by the JavaBeautifier by hitting the
    // undo button on the edit menu.
    OutputStream oStream =
      node.getBuffer().getOutputStream();
    call_the_JavaBeautifier(iStream,oStream);
  }
  catch (Exception ex) {}
}