Leniel Maccaferri's blog: Regex

Showing posts with label Regex. Show all posts

Mp3tag and its useful actions like Replace with regular expression and Guess values

Posted by Leniel Maccaferri on 12/06/2012 11:57:00 AM

Every once in a while I’m in a situation where I need to refresh my mind about how to work with regular expressions and guess values in Mp3tag to batch process lots of MP3 in a single shot… I’ll try to keep this post as a reference for the artifices I use with Mp3tag so that I can get back here and see what and how I did to format my MP3 tags the way they should be.

Mp3tag - The Universal Tag Editor Mp3tag is IMHO the best MP3 tag editor out there. I’m so satisfied with it that I stopped trying to find a better tool. Nonetheless, I’m open to recommendations…

Mp3tag is a powerful and yet easy-to-use tool to edit metadata of common audio formats where it supports ID3v1, ID3v2.3, ID3v2.4, iTunes MP4, WMA, Vorbis Comments and APE Tags.

It can rename files based on the tag information, replace characters or words in tags and filenames, import/export tag information, create playlists and more.

I’ve been writing about MP3 in this blog. If you’re interested, you can check past posts here

As I commented above, to refresh my mind I tend to look for this post I wrote some time ago: More MP3 guessing pattern with Mp3tag in Mac OS but I decided to compile future endeavors in this area in a single post. I hope you enjoy.

This time I’m using Mp3tag in the Windows 8 side inside a Parallels Desktop virtual machine.

OK. After some formalities, let’s complete these five basic steps needed in every use case I’ll present in this post:

1 - Click the change directory button and navigate to the folder where you store the MP3s you want to edit.

Mp3tag Change Directory button

2 - Select the files you want to edit. I just press Ctrl + A to select all the files. You can also hold Ctrl to select file by file.

3 - Click the Actions (Quick) button.

Mp3tag Actions (Quick) button

4 - Follow the use cases…

Removing year between parenthesis from file name

Guessing values for Artist and Title from file name

5 - DO NOT FORGET to click the Save button after executing the actions of each use case so that the changes get applied to the files; otherwise you’ll lose the edits. Confused smile

Mp3tag Save button

Use cases

Removing year between parenthesis from file name

Mp3tag Filename with Year between parenthesis

After clicking the Actions button, select Replace with regular expression, select the _FILENAME field and enter the regular expression $ 2o12 $. Press OK and you’re done:

Mp3tag File name with Year between parenthesis regex

Guessing values for Artist and Title from File name

Mp3tag Guess Values for Artist and Title from File name

As you see the MP3 files don’t have Artist and Title information. This is bad. If you use services like Last.fm to keep track of the music you’ve been listening to, you won’t be able to scrobble given the missing metadata. Of course you can fill the info by hand (Oh Lord! How boring and time consuming this task is). There’s Mp3tag to the rescue.

Taking the previous use case as a necessary step to format the file name accordingly…

After clicking the Actions button, select Guess values, in Source format enter %_filename%, in Guessing pattern enter \%artist% - %title%. Press OK and you’re done:

Mp3tag Guess Values for Artist and Title from File name regex

References

MP3 Tag - The Universal Tag Editor Help pages

JavaScript regex + jQuery to allow only English chars/letters in input textbox

Posted by Leniel Maccaferri on 10/12/2012 12:03:00 PM

Yesterday I answered this question at StackOverflow. The questioner wanted this:

How to allow only English, numeric and special chars in textbox using jQuery or JavaScript?

My answer was this:

$('#mytextbox').bind('keyup blur', function () {
    $(this).val($(this).val().replace(/[^A-Za-z]/g, ''))
});

It has a flaw because if you test it, you’ll see that every time a disallowed character is typed, the keyboard caret/text cursor goes to the end of the textbox. This is something that shouldn’t happen. Aristos mentioned something that I hadn’t tested: “the problem with your answer is that if you try to type something in the middle of the text, the cursor moves to the end.” He is absolutely right. You can test it below. Click the Result tab and type a digit like 7 or some other disallowed char in the middle of the textbox text to spot the problem:

OK – of course there must be a solution to this annoying problem and I couldn't miss the opportunity to play with it to find something that just works as expected. This answer by Ender helped me get there. The following code is what I came up with. I commented the code so it should be easy to understand what's going on:

$("#mytextbox").on("keypress", function (event) {

    // Disallow anything not matching the regex pattern (A to Z uppercase, a to z lowercase and white space)
    // For more on JavaScript Regular Expressions, look here: https://developer.mozilla.org/en-US/docs/JavaScript/Guide/Regular_Expressions
    var englishAlphabetAndWhiteSpace = /[A-Za-z ]/g;

    // Retrieving the key from the char code passed in event.which
    // For more info on even.which, look here: http://stackoverflow.com/q/3050984/114029
    var key = String.fromCharCode(event.which);

    //alert(event.keyCode);

    // For the keyCodes, look here: http://stackoverflow.com/a/3781360/114029
    // keyCode == 8  is backspace
    // keyCode == 37 is left arrow
    // keyCode == 39 is right arrow
    // englishAlphabetAndWhiteSpace.test(key) does the matching, that is, test the key just typed against the regex pattern
    if (event.keyCode == 8 || event.keyCode == 37 || event.keyCode == 39 || englishAlphabetAndWhiteSpace.test(key)) {
        return true;
    }

    // If we got this far, just return false because a disallowed key was typed.
    return false;
});

$('#mytextbox').on("paste", function (e) {
    e.preventDefault();
});

Try it below:

Now you can type in the middle of the text and the text cursor should maintain its position.

The above version allows accented chars like á, é, â, ê, etc…

Note also that there’s a new event handler function (paste) attached to the input textbox. This removes the possibility of user pasting text in the textbox.

If you want to allow numbers/digits or any other special characters, it’s just a matter of updating the regular expression pattern. For example: if you want to allow digits from 0 to 9, just use this regex pattern:

var englishAlphabetAndDigits = /[A-Za-z0-9 ]/g;

Full code:

$("#mytextbox").on("keypress", function (event) {

    // Disallow anything not matching the regex pattern (A to Z uppercase, a to z lowercase, digits 0 to 9 and white space)
    // For more on JavaScript Regular Expressions, look here: https://developer.mozilla.org/en-US/docs/JavaScript/Guide/Regular_Expressions
    var englishAlphabetDigitsAndWhiteSpace = /[A-Za-z0-9 ]/g;

    // Retrieving the key from the char code passed in event.which
    // For more info on even.which, look here: http://stackoverflow.com/q/3050984/114029
    var key = String.fromCharCode(event.which);

    //alert(event.keyCode);

    // For the keyCodes, look here: http://stackoverflow.com/a/3781360/114029
    // keyCode == 8  is backspace
    // keyCode == 37 is left arrow
    // keyCode == 39 is right arrow
    // englishAlphabetDigitsAndWhiteSpace.test(key) does the matching, that is, test the key just typed against the regex pattern
    if (event.keyCode == 8 || event.keyCode == 37 || event.keyCode == 39 || englishAlphabetDigitsAndWhiteSpace.test(key)) {
        return true;
    }

    // If we got this far, just return false because a disallowed key was typed.
    return false;
});

$('#mytextbox').on("paste", function (e) {
    e.preventDefault();
});

Here’s an updated jsFiddle that allows digits:

It’s really impressive what some small amount of JavaScript and jQuery code allows you to do on the client side. Actually if you care there are more comments than codez… Surprised smile

There’s no doubt JavaScript and JQuery will reign in the years to come.

About the code
As everything in software the above piece of code needs more testing. If you find a bug or some functionality that is missing, just leave a comment.

References
JavaScript Regular Expressions Guide by Mozzila.org

Using Regular Expressions to correct mistagged MP3

Posted by Leniel Maccaferri on 3/29/2011 07:01:00 PM

Two months have passed since I last posted something here. These were 2 busy months in my life. Hooray, I bought a brand new car and got my driver license, not necessarily in this order.

This post is about something I had planned to write sometime ago… the two screenshots shown here I got maybe 3 months ago. :D So let’s get to it.

As a big and eclectic fan of music that I’m, every now and then I see mistagged MP3 files like the ones with title tags that contain both the artist name and the song name. The following picture shows what I mean:

Picture 1 - Mistagged MP3 files (Title field has both the Artist and Song names)

Here’s where Mp3tag comes to the rescue. As you see in Picture 1, I’m using the dialog “Replace with regular expression”. Read my previous post about this great piece of software called Mp3tag to see how to get to this dialog. It has 3 fields that you must fill to make some magic happen allowing you to correct those wrongly tagged/mistagged MP3 titles all at once. Ha! You won’t lose your precious time correcting MP3 by MP3. I know that this is boring and that’s why I desperately searched for a solution. I know that if you’re reading this, you’re probably in the same situation and you just found a solution. :)

I’ve chosen the Field TITLE since it’s the problematic field in this particular case. Now the most important part, the so called Regex or Regular expression: (.*) - (.*). This thing means that we’re gonna separate the MP3 Title field in two parts. One part will have everything (.*) before the hyphen - and the other part will contain everything after the hyphen (.*).

Example:

Dru Hill - Away (Prod. by B.Cox) (Full + NoShout) (2010)

The regex (.*) - (.*) will separate the MP3 title above in two parts…

$1 = Dru Hill
$2 = Away (Prod. by B.Cox) (Full + NoShout) (2010)

The Replace matches with field has the value $2 because in this case I want to replace/substitute the MP3 Title with only the Song/Track name (the 2nd part/match of the regex above). If instead I wanted to keep the Artist name in the Title tag (D'oh!, not something I’d want to do), I’d write $1 in this field.

Now, take a look at Picture 2. When you click OK, this is the end result/magic you get. Nice and correctly titled/tagged MP3 files. The way I wanted them to be.

Picture 2 - Correctly tagged MP3 files

To make things last forever, do not forget to click the Save button present in Mp3tag’s toolbar or in the Save tag option present in the File menu. I like to press Ctrl+S as a shortcut.

If you want to learn the basics about regular expressions to use with Mp3tag, check this out: http://help.mp3tag.de/options_format.html#regexp

As you see, using regexes (one of the most powerful features of computers) you can make any kind of change to your MP3 tags like for example removing that (2010) present in each MP3 Title field above. That 2010 should be in its proper MP3 tag, namely the Year tag. Don’t ya think?

Hope this simple process helps someone out there keep an organized MP3 library as I do like to keep mine.

Note
Mp3tag is a Windows only application as is Windows Live Writer that I use to write these blog posts. I use/run it through Parallels Desktop on my Mac mini. Read this post to get more info about how to run Windows side by side with your Mac OS.

Regex matching and naming groups in C#

Posted by Leniel Maccaferri on 9/26/2009 06:52:00 PM

Let’s say you have a string and want to match individual groups of characters within that string. How would you do that? That’s the question I asked myself.
The following code is the outcome of some research on how to get this done. It shows you how to capture/match and name groups of characters using a regex pattern.

class RegexGroupsNames
{
    public static void Main()
    {
        // String to be parsed
        string str = "777L777_333_4444_55555_22_20090926_1727_666666_999999999_1010101010";

        // Regex pattern
        // Here we define the groups that form our string according to our need.
        // Each group has a name so that it's easier to get the individual values.
        string pattern =
               @"(?<group1>\d{3}[A-Z]\d{3})_(?<group2>\d{3})_(?<group3>\d{4})_(?<group4>\d{5})_(?<group5>\d{2})_(?<group6>\d{8})_(?<group7>\d{4})_(?<group8>\d{6})_(?<group9>\d{9})_(?<group10>\d{10})";

        // Creating the Regex used to parse the string with the pattern defined above
        Regex regex = new Regex(pattern);

        // String is a match or not ?
        Console.WriteLine("{0} {1} a valid string.", str, Regex.IsMatch(str, pattern, RegexOptions.IgnoreCase) ? "is" : "is not");

        // Matching the string and getting the groups named above
        Match match = regex.Match(str);

        // Writing the values of each group
        Console.WriteLine("group1  = {0}", match.Groups["group1"].Value);
        Console.WriteLine("group2  = {0}", match.Groups["group2"].Value);
        Console.WriteLine("group3  = {0}", match.Groups["group3"].Value);
        Console.WriteLine("group4  = {0}", match.Groups["group4"].Value);
        Console.WriteLine("group5  = {0}", match.Groups["group5"].Value);
        Console.WriteLine("group6  = {0}", match.Groups["group6"].Value);
        Console.WriteLine("group7  = {0}", match.Groups["group7"].Value);
        Console.WriteLine("group8  = {0}", match.Groups["group8"].Value);
        Console.WriteLine("group9  = {0}", match.Groups["group9"].Value);
        Console.WriteLine("group10 = {0}", match.Groups["group10"].Value);

        // Defining the Culture to show the DateTime
        IFormatProvider culture = new CultureInfo("en-US", true);

        // Creating a DateTime variable with the data contained within groups 6 and 7
        DateTime dt = DateTime.ParseExact(match.Groups["group6"].Value + match.Groups["group7"].Value, "yyyyMMddHHmm", culture);

        Console.WriteLine(dt);
    }
}

This is the string to be parsed:

"777L777_333_4444_55555_22_20090926_1727_666666_999999999_1010101010"

It has 10 parts “groups” separated by an underscore character ( _ ).

What we want to do is to extract each individual group so that we can manipulate it anyway we want.

To accomplish that we define a regex that has the following pattern:

@"(?<group1>\d{3}[A-Z]\d{3})_(?<group2>\d{3})_(?<group3>\d{4})_(?<group4>\d{5})_(?<group5>\d{2})_(?<group6>\d{8})_(?<group7>\d{4})_(?<group8>\d{6})_(?<group9>\d{9})_(?<group10>\d{10})";

Let’s dissect the regex…

The 1st group denoted by the first pair of round brackets ( ) will look for 3 digits \d{3} followed by an uppercase letter ranging from A through Z [A-Z] followed by another 3 digits \d{3}.
The 2nd group denoted by the second pair of parenthesis will look for 3 digits, the 3rd group will look for 4 digits, the 4th group will look for 5 digits and so on… I think you got the point!  :- )

While playing with this I found an interesting thing, that is, you can give a name to each group. In this case, the name goes inside the angle brackets < > preceded by a question mark. I’ve given the name <group1> to the first group and just incremented the final number in the others.

Naming your groups is great because you can refer to them while manipulating the matched groups without having to remember the exact position inside the string. Instead of doing this:

// Writing the values of each group
Console.WriteLine(match.Groups[0].Value);
Console.WriteLine(match.Groups[1].Value);
Console.WriteLine(match.Groups[2].Value);
Console.WriteLine(match.Groups[3].Value);
Console.WriteLine(match.Groups[4].Value);
Console.WriteLine(match.Groups[5].Value);
Console.WriteLine(match.Groups[6].Value);
Console.WriteLine(match.Groups[7].Value);
Console.WriteLine(match.Groups[8].Value);
Console.WriteLine(match.Groups[9].Value);

We can do this:

// Writing the values of each group
Console.WriteLine(match.Groups["group1"].Value);
Console.WriteLine(match.Groups["group2"].Value);
Console.WriteLine(match.Groups["group3"].Value);
Console.WriteLine(match.Groups["group4"].Value);
Console.WriteLine(match.Groups["group5"].Value);
Console.WriteLine(match.Groups["group6"].Value);
Console.WriteLine(match.Groups["group7"].Value);
Console.WriteLine(match.Groups["group8"].Value);
Console.WriteLine(match.Groups["group9"].Value);
Console.WriteLine(match.Groups["group10"].Value);

Isn’t it cool!?

With this you can create your regex pattern and match the groups of characters that interest you the most.

Grouping enables you to work with separate sets of data. Naming each group enables you to refer to each one of them easily.

This is the output of the code:

Regular Expression Grouping and Naming

Hope this helps.

References
RegExLib.com - Regular Expression Library
http://regexlib.com/

Silverlight Regular Expression Tester
http://regexlib.com/RESilverlight.aspx

Regex engine in C# - matching strings

Posted by Leniel Maccaferri on 5/11/2009 02:55:00 AM

To close this series of posts, today I’m going to match some input strings using the regex (l|e)*n?(i|e)el* that we’ve been using since the beginning.

To match the strings I’ll make use of the DFA we constructed in the last post titled Regex engine in C# - the DFA.

These are the 20 strings I’ll be matching:

eee, eeeil, eel, ennil, ie, leie, lele, leleel, lelel, lelenil, leliel, leniel, llnel, ln, lnel, lniel, nelll, niel, nil and nll.

In the DFA class we have a method called Simulate which I show bellow:

public string Simulate(string @in)
{
  state currentState = start;

  CharEnumerator i = @in.GetEnumerator();

  while(i.MoveNext())
  {
    KeyValuePair<state, input> transition = new KeyValuePair<state, input>(currentState, i.Current);

    if(!transTable.ContainsKey(transition))
      return "Rejected";

    currentState = transTable[transition];
  }

  if(final.Contains(currentState))
    return "Accepted";
  else
    return "Rejected";
}

The Simulate method takes as input a string to be matched and returns a string that signalizes success or failing when matching such a string.

To test it I’ll match the string “leniel” which by the way is my own name. :-)

So, what the Simulate method do?

It starts by assigning the start state 0 to the variable currentState.

Next we get a charEnumerator that is used in the while block to move letter by letter (input symbol by input symbol) till we get to the end of the string.

We declare a KeyValuePair<state, input> designated transition that has as the key the currentState and as the value the current input symbol we’re simulating.

We check to see if the DFA’s transition table contains such a transition, that is, if there’s a valid path from that state with that input symbol to another state. If it doesn’t, we reject the string we’re matching, otherwise we make the currentState variable receive the next state, that is, the state appointed by the transition we’ve just checked.

The process inside the while block goes on until we reach the last input symbol taken from the string we’re matching, in this case, the last “l” letter.

After getting out of the while block we make a final check to see if the state we achieved is part of the DFA’s set of final states. If it is, we accept the string, otherwise, we reject it.

This is an x-ray from the variables’ value when in the first iteration of the while block:

Regex Parser Matching the string "leniel"

As you can see in the transition table, from start state (currentState) “0” with the fist input symbol “l” we can go to state “3”.

The following table shows the result obtained while testing the strings mentioned above:

Accepted	Rejected
eee	eeeil
eel	ennil
ie	lele
leie	lelel
leleel	lelenil
leliel	llnel
leniel	ln
lniel	lnel
niel	nelll
	nil
	nll

To make sure it’s correct, debug each one of these strings visually looking at the DFA’s graph representation shown below. Starting at state 0 we must end in one of the final states {7, 8, 9, 10}.

The Regex Engine executable
The Regex Engine presented in this series of posts is a C# Console Application. As such it was written in a way that its command line arguments must be entered in the following form:

RegularExpressionEngine "(l|e)*n?(i|e)el*" leniel

Where:

RegularExpressionEngine -> the name of the executable .exe file (the program itself)

"(l|e)*n?(i|e)el*" –> between the double quotes we pass the regex

leniel –> the string we want to match in the regex

Using Microsoft Visual Studio C# 2008 we can set the command line arguments using the Debug properties page of the project like the following picture:

Regex Expression Engine Debug Properties Page

To get to the Debug page, right click in the solution inside Solution Explorer as shown in the following picture:

Regex Expression Engine Solution Properties

After setting the command line arguments, hit F5 and you’re ready to go.

The Regex Engine source code
You can get the complete code (Microsoft Visual C# 2008 Console application) and executable at:

http://leniel.googlepages.com/RegularExpressionEngine.rar

To try out the code you can use the free Microsoft Visual C# 2008 Express Edition that you can get at: http://www.microsoft.com/express/vcsharp

Updated on 5/12/2009 10:06:00 PM

As I finished writing the posts, here goes the list that points to them:

Regular Expression Engine in C# (the Story)
Regex engine in C# - the Regex Parser
Regex engine in C# - the NFA
Regex engine in C# - the DFA

Source code: https://github.com/leniel/RegexEngine

References
The following links can help you when dealing with regexes:

Regular-Expressions.info - Regex Tutorial, Examples and Reference - Regex Patterns
http://www.regular-expressions.info

Regular Expression Library (great site with lots of regexes and an excellent regex tester)
http://regexlib.com/Default.aspx
http://regexlib.com/RETester.aspx

Regex engine in C# - the DFA

Posted by Leniel Maccaferri on 5/03/2009 03:21:00 AM

This time, we’ll take a look at the DFA’s class and its helper class called SubsetMachine.

To understand what’s a DFA, refer to the first post in this series called Regex engine in C# - the Regex Parser.

In the Regex engine in C# - the NFA post we ended with an NFA.

Now we’re going to build a DFA based on such NFA.

Remember that the main difference between a DFA and an NFA is that a DFA doesn’t have epsilon (ε) transitions that represent "nothing" or "no input" between states.

As described in the section DFA versus NFA in the introduction of this series of posts, it may be shown that a DFA is equivalent to an NFA, in that, for any given NFA, one may construct an equivalent DFA, and vice-versa: this is the powerset construction or subset construction.

So, let’s get our hands dirty with some code.

Below I present the DFA class:

//
//  Regular Expression Engine C# Sample Application
//  2006, by Leniel Braz de Oliveira Macaferi & Wellington Magalhães Leite.
//
//  UBM's Computer Engineering - 7th term [http://www.ubm.br/]
//  
//  This program sample was developed and turned in as a term paper for Lab. of
//  Compilers Construction. It was based on the source code provided by Eli Bendersky
//  [http://eli.thegreenplace.net/] and is provided "as is" without warranty.
//

using System;
using SCG = System.Collections.Generic;
using C5;

using state = System.Int32;
using input = System.Char;

namespace RegularExpressionEngine
{
  /// <summary>
  /// Implements a deterministic finite automata (DFA)
  /// </summary>
  class DFA
  {
    // Start state
    public state start;
    // Set of final states
    public Set<state> final;
    // Transition table
    public SCG.SortedList<KeyValuePair<state, input>, state> transTable;

    public DFA()
    {
      final = new Set<state>();

      transTable = new SCG.SortedList<KeyValuePair<state, input>, state>(new Comparer());
    }

    public string Simulate(string @in)
    {
      state curState = start;

      CharEnumerator i = @in.GetEnumerator();

      while(i.MoveNext())
      {
        KeyValuePair<state, input> transition = new KeyValuePair<state, input>(curState, i.Current);

        if(!transTable.ContainsKey(transition))
          return "Rejected";

        curState = transTable[transition];
      }

      if(final.Contains(curState))
        return "Accepted";
      else
        return "Rejected";
    }

    public void Show()
    {
      Console.Write("DFA start state: {0}\n", start);
      Console.Write("DFA final state(s): ");

      SCG.IEnumerator<state> iE = final.GetEnumerator();

      while(iE.MoveNext())
        Console.Write(iE.Current + " ");

      Console.Write("\n\n");

      foreach(SCG.KeyValuePair<KeyValuePair<state, input>, state> kvp in transTable)
        Console.Write("Trans[{0}, {1}] = {2}\n", kvp.Key.Key, kvp.Key.Value, kvp.Value);
    }
  }

  /// <summary>
  /// Implements a comparer that suits the transTable SordedList
  /// </summary>
  public class Comparer : SCG.IComparer<KeyValuePair<state, input>>
  {
    public int Compare(KeyValuePair<state, input> transition1, KeyValuePair<state, input> transition2)
    {
      if(transition1.Key == transition2.Key)
        return transition1.Value.CompareTo(transition2.Value);
      else
        return transition1.Key.CompareTo(transition2.Key);
    }
  }

}

As you see, a DFA has 3 variables: a start state, a set of final states and a transition table that maps transitions between states.

Below I present the SubsetMachine class that is responsible for the hard work of extracting an equivalent DFA from a given NFA:

//
//  Regular Expression Engine C# Sample Application
//  2006, by Leniel Braz de Oliveira Macaferi & Wellington Magalhães Leite.
//
//  UBM's Computer Engineering - 7th term [http://www.ubm.br/]
//  
//  This program sample was developed and turned in as a term paper for Lab. of
//  Compilers Construction. It was based on the source code provided by Eli Bendersky
//  [http://eli.thegreenplace.net/] and is provided "as is" without warranty.
//

using System;
using SCG = System.Collections.Generic;
using C5;

using state = System.Int32;
using input = System.Char;

namespace RegularExpressionEngine
{
  class SubsetMachine
  {
    private static int num = 0;

    /// <summary>
    /// Subset machine that employs the powerset construction or subset construction algorithm.
    /// It creates a DFA that recognizes the same language as the given NFA.
    /// </summary>
    public static DFA SubsetConstruct(NFA nfa)
    {
      DFA dfa = new DFA();

      // Sets of NFA states which is represented by some DFA state
      Set<Set<state>> markedStates = new Set<Set<state>>();
      Set<Set<state>> unmarkedStates = new Set<Set<state>>();

      // Gives a number to each state in the DFA
      HashDictionary<Set<state>, state> dfaStateNum = new HashDictionary<Set<state>, state>();

      Set<state> nfaInitial = new Set<state>();
      nfaInitial.Add(nfa.initial);

      // Initially, EpsilonClosure(nfa.initial) is the only state in the DFAs states and it's unmarked.
      Set<state> first = EpsilonClosure(nfa, nfaInitial);
      unmarkedStates.Add(first);

      // The initial dfa state
      state dfaInitial = GenNewState();
      dfaStateNum[first] = dfaInitial;
      dfa.start = dfaInitial;

      while(unmarkedStates.Count != 0)
      {
        // Takes out one unmarked state and posteriorly mark it.
        Set<state> aState = unmarkedStates.Choose();

        // Removes from the unmarked set.
        unmarkedStates.Remove(aState);

        // Inserts into the marked set.
        markedStates.Add(aState);

        // If this state contains the NFA's final state, add it to the DFA's set of
        // final states.
        if(aState.Contains(nfa.final))
          dfa.final.Add(dfaStateNum[aState]);

        SCG.IEnumerator<input> iE = nfa.inputs.GetEnumerator();

        // For each input symbol the nfa knows...
        while(iE.MoveNext())
        {
          // Next state
          Set<state> next = EpsilonClosure(nfa, nfa.Move(aState, iE.Current));

          // If we haven't examined this state before, add it to the unmarkedStates and make up a new number for it.
          if(!unmarkedStates.Contains(next) && !markedStates.Contains(next))
          {
            unmarkedStates.Add(next);
            dfaStateNum.Add(next, GenNewState());
          }

          KeyValuePair<state, input> transition = new KeyValuePair<state, input>();
          transition.Key = dfaStateNum[aState];
          transition.Value = iE.Current;

          dfa.transTable[transition] = dfaStateNum[next];
        }
      }

      return dfa;
    }

    /// <summary>
    /// Builds the Epsilon closure of states for the given NFA 
    /// </summary>
    /// <param name="nfa"></param>
    /// <param name="states"></param>
    /// <returns></returns>
    static Set<state> EpsilonClosure(NFA nfa, Set<state> states)
    {
      // Push all states onto a stack
      SCG.Stack<state> uncheckedStack = new SCG.Stack<state>(states);

      // Initialize EpsilonClosure(states) to states
      Set<state> epsilonClosure = states;

      while(uncheckedStack.Count != 0)
      {
        // Pop state t, the top element, off the stack
        state t = uncheckedStack.Pop();

        int i = 0;

        // For each state u with an edge from t to u labeled Epsilon
        foreach(input input in nfa.transTable[t])
        {
          if(input == (char)NFA.Constants.Epsilon)
          {
            state u = Array.IndexOf(nfa.transTable[t], input, i);

            // If u is not already in epsilonClosure, add it and push it onto stack
            if(!epsilonClosure.Contains(u))
            {
              epsilonClosure.Add(u);
              uncheckedStack.Push(u);
            }
          }

          i = i + 1;
        }
      }

      return epsilonClosure;
    }

    /// <summary>
    /// Creates unique state numbers for DFA states
    /// </summary>
    /// <returns></returns>
    private static state GenNewState()
    {
      return num++;
    }

  }
}

In the first post of this series we see the following line of code:

DFA dfa = SubsetMachine.SubsetConstruct(nfa);

The SubsetConstruct method from the SubsetMachine class receives as input an NFA and returns a DFA.

Inside the SubsetConstruct method we firstly instantiate a new DFA object and then we create two variables markedStates and unmarkedStates that are sets of NFA states which represent a DFA state.

// Sets of NFA states which is represented by some DFA state
Set<Set<state>> markedStates = new Set<Set<state>>();
Set<Set<state>> unmarkedStates = new Set<Set<state>>();

From this we see that a DFA state can represent a set of NFA states. Take a look at the introductory post and see Figure 2. It shows two DFA states that represent sets of NFA states, in this particular case the DFA final states represent the NFA states {s2, s3} and {s5, s6}.

The HashDictionary helps us to give a name (to number) each DFA state.

// Gives a number to each state in the DFA
HashDictionary<Set<state>, state> dfaStateNum = new HashDictionary<Set<state>, state>();

We declare a variable called nfaInitial that is a set of states. It receives the initial NFA state:

Set<state> nfaInitial = new Set<state>();
nfaInitial.Add(nfa.initial);

We’ll start using the EpsilonClosure function.

// Initially, EpsilonClosure(nfa.initial) is the only state in the DFAs states and it's unmarked.
Set<state> first = EpsilonClosure(nfa, nfaInitial);

The EpsilonClosure function receives as parameters the NFA and its initial state and returns a set of states. Take a look at the method signature:

static Set<state> EpsilonClosure(NFA nfa, Set<state> states)

So, what does it do? You may ask. To answer this question let’s debug this first method call:

From the NFA transition table presented in Figure 2 and from the transition graph presented in Figure 3 in the second post of this series we can see how many transitions are represented by eps transitions.

The first time we enter into this function we’ll get as a return value a set of states that contains all the states that are reachable with an eps transition from the start state 0.

Figure 1 - States reachable by an eps transition from start state 0.

For the sake of comparison I’ll show the NFA’s graph representation for the regex (l|e)*n?(i|e)el* that we’re studying since the beginning of this series.

Figure 2 - NFA’s graph representation for the regex (l|e)*n?(i|e)el*

If you pay close attention you’ll see that the order the regex parser found the states is the order we visually debug the code looking at the graph above.

With such states found we move next adding this DFA state into the variable unmarkedStates.

We then use a function called GetNewState that is responsible for generating a number that uniquely identifies each state of the DFA:

// The initial dfa state
state dfaInitial = GenNewState();

When we pass to the next line of code we add to the dfaStateNum dictionary a key that is the set of states returned by the EpsilonClosure function and a value that is the name of the initial state of the DFA.

dfaStateNum[first] = dfaInitial;

We make the initial state of the DFA be the dfaInitial value we just got.

dfa.start = dfaInitial;

Next we enter in the first while keyword. In this while we basically extract one of the unmarkedStates and add the same to the markedStates set. This has the meaning of telling that we already checked such state.

// Takes out one unmarked state and posteriorly mark it.
Set<state> aState = unmarkedStates.Choose();

// Removes from the unmarked set.
unmarkedStates.Remove(aState);

// Inserts into the marked set.
markedStates.Add(aState);

In the next line of code (one of the most interesting parts of the whole code) we check to see if this current DFA state (remember that it is a set of states) we’re on contains the NFA final state, if it holds true, we add it to the DFA’s set of final states:

// If this state contains the NFA's final state, add it to the DFA's set of final states.
if(aState.Contains(nfa.final))
  dfa.final.Add(dfaStateNum[aState]);

Now it’s time to check against the NFA’s input symbols. To accomplish this we declare an enumerator of type state that does the job of moving through each of the input symbols in the next while code block:

SCG.IEnumerator<input> iE = nfa.inputs.GetEnumerator();

// For each input symbol the nfa knows...
while(iE.MoveNext())
{ . . .

Now it’s time to create the next DFA state. We do this by declaring a new set of states and we call the EpsilonClosure function again to fill this state, but this time we pass the EpsilonClosure function a different second parameter.

// Next state
Set<state> next = EpsilonClosure(nfa, nfa.Move(aState, iE.Current));

Let’s go deeper to take a look at this second parameter.

As you see we call the function Move that is part of the NFA class. This function receives as parameters a set of states and an input symbol to be checked against. It returns a set of states.

What the move function does is: foreach state in the set of states passed as the first parameter we check each transition present in the NFA’s transition table from this state to another state with the input symbol passed as the second parameter.

So, the first time we pass we get the following output from the Move function:

NFAMoveFunction

Figure 3 - Result from the NFA’s Move function the 1st time it’s called

If we look at Figure 2 we can assert that from the states present in the first state of the DFA (see Figure 1) we can move to states {5, 16} with the first NFA input that is equal to ‘e’.

With the above result taken from the Move function we’re ready to go the EpsilonClosure function for the second time to create the 2nd DFA state in the SubsetMachine class. This second time we get the following result from EpsilonClosure function:

EpsilonClosureFunction2

Figure 4 - Result from the EpsilonClosure function the 2nd time it’s called

Now, if you pay close attention, we can assert that starting at the states {5, 16} we can move with an eps-transition to the states shown above. Remember that the states we pass to the EpsilonClosure function are themselves included in the result returned by the function.

Now that we have created the 2nd DFA state we check to see if it wasn’t examined yet and if it holds true we add it to the unmarkedStates variable and give a new name to this state numbering it with the GenNewState function.

// If we haven't examined this state before, add it to the unmarkedStates and make up a new number for it.
if(!unmarkedStates.Contains(next) && !markedStates.Contains(next))
{
  unmarkedStates.Add(next);
  dfaStateNum.Add(next, GenNewState());
}

Now the best part of it. :)

We create a new transition that has as key the number of the DFA state we’re checking and as the value the current input symbol we’re after.

KeyValuePair<state, input> transition = new KeyValuePair<state, input>();
transition.Key = dfaStateNum[aState];
transition.Value = iE.Current;

We then add this transition to the DFA’s transition table:

DFATransitionTable

Figure 5 - DFA’s transition table

This has the following meaning: from state 0 with input ‘e’ go to state 1!

These are the subsequent values we get for the first unmarkedState we’re checking:

With input ‘i’ we can go to state { 14 } from which with an eps transition we can go to state { 17 }.

With input ‘l’ we can go to state { 3 } from which with an eps transition we can go to states { 4, 13, 8, 3, 12, 7, 2, 11, 6, 1, 15, 10 }.

With input ‘n’ we can go to state { 9 } from which with an eps transition we can go to states { 12, 9, 13, 15 }.

A point that deserves consideration is that each time you run the regex parser it’s not guaranteed that the numbers that identify the DFA states will remain the same.

I won’t continue debugging because it would consume a lot of space in this blog post.

I think that with the above explanation it’s easy to get the point.

In short we’ll repeat the above steps for each unmarked state that hasn’t been checked yet working with it against each input symbol.

For the regex (l|e)*n?(i|e)el* in one of the times I ran the code, I got the following DFA’s transition table:

DFA start state: 0
DFA final state(s): 7 8 9 10

Trans[0, e] = 1

Trans[0, i] = 2

Trans[0, l] = 3

Trans[0, n] = 4

Trans[1, e] = 7

Trans[1, i] = 2

Trans[1, l] = 3

Trans[1, n] = 4

Trans[2, e] = 8

Trans[2, i] = 6

Trans[2, l] = 6

Trans[2, n] = 6

Trans[3, e] = 1

Trans[3, i] = 2

Trans[3, l] = 3

Trans[3, n] = 4

Trans[4, e] = 5

Trans[4, i] = 2

Trans[4, l] = 6

Trans[4, n] = 6

Trans[5, e] = 8

Trans[5, i] = 6

Trans[5, l] = 6

Trans[5, n] = 6

Trans[6, e] = 6

Trans[6, i] = 6

Trans[6, l] = 6

Trans[6, n] = 6

Trans[7, e] = 7

Trans[7, i] = 2

Trans[7, l] = 10

Trans[7, n] = 4

Trans[8, e] = 6

Trans[8, i] = 6

Trans[8, l] = 9

Trans[8, n] = 6

Trans[9, e] = 6

Trans[9, i] = 6

Trans[9, l] = 9

Trans[9, n] = 6

Trans[10, e] = 1

Trans[10, i] = 2

Trans[10, l] = 10

Trans[10, n] = 4

Figure 6 - DFA’s transition table for the regex (l|e)*n?(i|e)el*

Below is the DFA’s graph representation:

DFA for the Regex (l|e)*n?(i|e)el*

Figure 7 - DFA’s graph representation for the regex (l|e)*n?(i|e)el*

In the next post I’ll simulate some input strings against this DFA to assert its validity.

See you there!

Updated on 5/12/2009 09:57:00 PM

As I finished writing the posts, here goes the list that points to them:

Regular Expression Engine in C# (the Story)
Regex engine in C# - the Regex Parser
Regex engine in C# - the NFA
Regex engine in C# - matching strings

Source code: https://github.com/leniel/RegexEngine

Regex engine in C# - the NFA

Posted by Leniel Maccaferri on 4/13/2009 01:04:00 AM

As promised, let’s dive into the NFA’s class. To understand what’s an NFA, refer to the first post in this series called Regex engine in C# - the Regex Parser.

Last time we ended with the main Regex Engine class (RegexParser) which delegates to the NFA class the logic for constructing the NFA based on the parse tree we got on the last post.

The following is the NFA class:

//
//  Regular Expression Engine C# Sample Application
//  2006, by Leniel Braz de Oliveira Macaferi & Wellington Magalhães Leite.
//
//  UBM's Computer Engineering - 7th term [http://www.ubm.br/]
//  
//  This program sample was developed and turned in as a term paper for Lab. of
//  Compilers Construction. It was based on the source code provided by Eli Bendersky
//  [http://eli.thegreenplace.net/] and is provided "as is" without warranty.
//
//  It makes use of C5 collections library (see reference at the end of this post.)
//

using System;
using SCG = System.Collections.Generic;
using C5;

using state = System.Int32;
using input = System.Char;

namespace RegularExpressionEngine
{

  /// <summary>
  /// Implements a non-deterministic finite automata
  /// </summary>
  class NFA
  {
    public state initial;
    public state final;
    private int size;
    // Inputs this NFA responds to
    public SortedArray<input> inputs;
    public input[][] transTable;

    /// <summary>
    /// Provides default values for epsilon and none
    /// </summary>
    public enum Constants
    {
      Epsilon = 'ε',
      None = '\0'
    }

    public NFA(NFA nfa)
    {
      initial = nfa.initial;
      final = nfa.final;
      size = nfa.size;
      inputs = nfa.inputs;
      transTable = nfa.transTable;
    }

    /// <summary>
    /// Constructed with the NFA size (amount of states), the initial state and the
    /// final state
    /// </summary>
    /// <param name="size_">Amount of states.</param>
    /// <param name="initial_">Initial state.</param>
    /// <param name="final_">Final state.</param>
    public NFA(int size_, state initial_, state final_)
    {
      initial = initial_;
      final = final_;
      size = size_;

      IsLegalState(initial);
      IsLegalState(final);

      inputs = new SortedArray<input>();

      // Initializes transTable with an "empty graph", no transitions between its
      // states
      transTable = new input[size][];

      for(int i = 0; i < size; ++i)
        transTable[i] = new input[size];
    }

    public bool IsLegalState(state s)
    {
      // We have 'size' states, numbered 0 to size-1
      if(s < 0 || s >= size)
        return false;

      return true;
    }

    /// <summary>
    /// Adds a transition between two states.
    /// </summary>
    /// <param name="from"></param>
    /// <param name="to"></param>
    /// <param name="in"></param>
    public void AddTrans(state from, state to, input @in)
    {
      IsLegalState(from);
      IsLegalState(to);

      transTable[from][to] = @in;

      if(@in != (char)Constants.Epsilon)
        inputs.Add(@in);
    }

    /// <summary>
    /// Fills states 0 up to other.size with other's states.
    /// </summary>
    /// <param name="other"></param>
    public void FillStates(NFA other)
    {
      for(state i = 0; i < other.size; ++i)
        for(state j = 0; j < other.size; ++j)
          transTable[i][j] = other.transTable[i][j];

      SCG.IEnumerator<input> cE = other.inputs.GetEnumerator();

      while(cE.MoveNext())
        inputs.Add(cE.Current);
    }

    /// <summary>
    /// Renames all the NFA's states. For each nfa state: number += shift.
    /// Functionally, this doesn't affect the NFA, it only makes it larger and renames
    /// its states.
    /// </summary>
    /// <param name="shift"></param>
    public void ShiftStates(int shift)
    {
      int newSize = size + shift;

      if(shift < 1)
        return;

      // Creates a new, empty transition table (of the new size).
      input[][] newTransTable = new input[newSize][];

      for(int i = 0; i < newSize; ++i)
        newTransTable[i] = new input[newSize];

      // Copies all the transitions to the new table, at their new locations.
      for(state i = 0; i < size; ++i)
        for(state j = 0; j < size; ++j)
          newTransTable[i + shift][j + shift] = transTable[i][j];

      // Updates the NFA members.
      size = newSize;
      initial += shift;
      final += shift;
      transTable = newTransTable;
    }

    /// <summary>
    /// Appends a new, empty state to the NFA.
    /// </summary>
    public void AppendEmptyState()
    {
      transTable = Resize(transTable, size + 1);

      size += 1;
    }

    private static input[][] Resize(input[][] transTable, int newSize)
    {
      input[][] newTransTable = new input[newSize][];

      for(int i = 0; i < newSize; ++i)
        newTransTable[i] = new input[newSize];

      for(int i = 0; i <= transTable.Length - 1; i++)
        for(int j = 0; j <= transTable[i].Length - 1; j++)
        {
          if(transTable[i][j] != '\0')
            newTransTable[i][j] = transTable[i][j];
        }

      return newTransTable;
    }

    /// <summary>
    /// Returns a set of NFA states from which there is a transition on input symbol
    /// inp from some state s in states.
    /// </summary>
    /// <param name="states"></param>
    /// <param name="inp"></param>
    /// <returns></returns>
    public Set<state> Move(Set<state> states, input inp)
    {
      Set<state> result = new Set<state>();

      // For each state in the set of states
      foreach(state state in states)
      {
        int i = 0;

        // For each transition from this state
        foreach(input input in transTable[state])
        {
          // If the transition is on input inp, add it to the resulting set
          if(input == inp)
          {
            state u = Array.IndexOf(transTable[state], input, i);
            result.Add(u);
          }

          i = i + 1;
        }
      }

      return result;
    }

    /// <summary>
    /// Prints out the NFA.
    /// </summary>
    public void Show()
    {
      Console.WriteLine("This NFA has {0} states: 0 - {1}", size, size - 1);
      Console.WriteLine("The initial state is {0}", initial);
      Console.WriteLine("The final state is {0}\n", final);

      for(state from = 0; from < size; ++from)
      {
        for(state to = 0; to < size; ++to)
        {
          input @in = transTable[from][to];

          if(@in != (char)Constants.None)
          {
            Console.Write("Transition from {0} to {1} on input ", from, to);

            if(@in == (char)Constants.Epsilon)
              Console.Write("Epsilon\n");
            else
              Console.Write("{0}\n", @in);
          }
        }
      }
      Console.Write("\n\n");
    }

    /// <summary>
    /// 
    /// </summary>
    /// <param name="tree"></param>
    /// <returns></returns>
    public static NFA TreeToNFA(ParseTree tree)
    {
        switch (tree.type)
        {
            case ParseTree.NodeType.Chr:
                return BuildNFABasic(tree.data.Value);
            case ParseTree.NodeType.Alter:
                return BuildNFAAlter(TreeToNFA(tree.left), TreeToNFA(tree.right));
            case ParseTree.NodeType.Concat:
                return BuildNFAConcat(TreeToNFA(tree.left), TreeToNFA(tree.right));
            case ParseTree.NodeType.Star:
                return BuildNFAStar(TreeToNFA(tree.left));
            case ParseTree.NodeType.Question:
                return BuildNFAAlter(TreeToNFA(tree.left), BuildNFABasic((char)Constants.Epsilon));
            default:
                return null;
        }
    }

    /////////////////////////////////////////////////////////////////
    //
    // NFA building functions
    //
    // Using Thompson Construction, build NFAs from basic inputs or 
    // compositions of other NFAs.
    //

    /// <summary>
    /// Builds a basic, single input NFA
    /// </summary>
    /// <param name="in"></param>
    /// <returns></returns>
    public static NFA BuildNFABasic(input @in)
    {
      NFA basic = new NFA(2, 0, 1);

      basic.AddTrans(0, 1, @in);

      return basic;
    }

    /// <summary>
    /// Builds an alternation of nfa1 and nfa2 (nfa1|nfa2)
    /// </summary>
    /// <param name="nfa1"></param>
    /// <param name="nfa2"></param>
    /// <returns></returns>
    public static NFA BuildNFAAlter(NFA nfa1, NFA nfa2)
    {
      // How this is done: the new nfa must contain all the states in
      // nfa1 and nfa2, plus a new initial and final states. 
      // First will come the new initial state, then nfa1's states, then
      // nfa2's states, then the new final state

      // make room for the new initial state
      nfa1.ShiftStates(1);

      // make room for nfa1
      nfa2.ShiftStates(nfa1.size);

      // create a new nfa and initialize it with (the shifted) nfa2
      NFA newNFA = new NFA(nfa2);

      // nfa1's states take their places in new_nfa
      newNFA.FillStates(nfa1);

      // Set new initial state and the transitions from it
      newNFA.AddTrans(0, nfa1.initial, (char)Constants.Epsilon);
      newNFA.AddTrans(0, nfa2.initial, (char)Constants.Epsilon);

      newNFA.initial = 0;

      // Make up space for the new final state
      newNFA.AppendEmptyState();

      // Set new final state
      newNFA.final = newNFA.size - 1;

      newNFA.AddTrans(nfa1.final, newNFA.final, (char)Constants.Epsilon);
      newNFA.AddTrans(nfa2.final, newNFA.final, (char)Constants.Epsilon);

      return newNFA;
    }

    /// <summary>
    /// Builds an alternation of nfa1 and nfa2 (nfa1|nfa2)
    /// </summary>
    /// <param name="nfa1"></param>
    /// <param name="nfa2"></param>
    /// <returns></returns>
    public static NFA BuildNFAConcat(NFA nfa1, NFA nfa2)
    {
      // How this is done: First will come nfa1, then nfa2 (its initial state replaced
      // with nfa1's final state)
      nfa2.ShiftStates(nfa1.size - 1);

      // Creates a new NFA and initialize it with (the shifted) nfa2
      NFA newNFA = new NFA(nfa2);

      // nfa1's states take their places in newNFA
      // note: nfa1's final state overwrites nfa2's initial state,
      // thus we get the desired merge automatically (the transition
      // from nfa2's initial state now transits from nfa1's final state)
      newNFA.FillStates(nfa1);

      // Sets the new initial state (the final state stays nfa2's final state,
      // and was already copied)
      newNFA.initial = nfa1.initial;

      return newNFA;
    }

    /// <summary>
    /// Builds a star (kleene closure) of nfa (nfa*)
    /// How this is done: First will come the new initial state, then NFA, then the new final state
    /// </summary>
    /// <param name="nfa"></param>
    /// <returns></returns>
    public static NFA BuildNFAStar(NFA nfa)
    {
      // Makes room for the new initial state
      nfa.ShiftStates(1);

      // Makes room for the new final state
      nfa.AppendEmptyState();

      // Adds new transitions
      nfa.AddTrans(nfa.final, nfa.initial, (char)Constants.Epsilon);
      nfa.AddTrans(0, nfa.initial, (char)Constants.Epsilon);
      nfa.AddTrans(nfa.final, nfa.size - 1, (char)Constants.Epsilon);
      nfa.AddTrans(0, nfa.size - 1, (char)Constants.Epsilon);

      nfa.initial = 0;
      nfa.final = nfa.size - 1;

      return nfa;
    }
  }

}

We pass the parse tree (see last post) to a function responsible for converting the parse tree to an NFA.

private NFA TreeToNFA(ParseTree tree)
{
  switch(tree.type)
  {
    case ParseTree.NodeType.Chr:
      return BuildNFABasic(tree.data.Value);
    case ParseTree.NodeType.Alter:
      return BuildNFAAlter(TreeToNFA(tree.left), TreeToNFA(tree.right));
    case ParseTree.NodeType.Concat:
      return BuildNFAConcat(TreeToNFA(tree.left), TreeToNFA(tree.right));
    case ParseTree.NodeType.Star:
      return BuildNFAStar(TreeToNFA(tree.left));
    case ParseTree.NodeType.Question:
      return BuildNFAAlter(TreeToNFA(tree.left), BuildNFABasic((char)Constants.Epsilon));
    default:
      return null;
  }
}

The TreeToNFA function delegates to the building functions that employ Thompson Construction to build the NFA from basic inputs or compositions of other NFAs.

TreeToNFA is recursive! It calls itself, that is, the calls are put into the stack. Debug the code to see how it works.

The building functions are well documented. Take a look at the comments in the code above.

While debugging the code we get the following variables’ structure:

NFADebugging
Figure 1 - Variables' structure while in debugging mode

From the above picture we see that there’s a transition from state 8 to state 9 with the input symbol ‘n’.

The NFA’s Show function will only output something if the transition’s value present in the transTable variable is different from 0 ‘\0’, that is, if there’s indeed a transition from one state to another.

So, for the regex (l|e)*n?(i|e)el* we get the following transition table:

This NFA has 22 states: 0 - 21
The initial state is 0
The final state is 21

Transition from 0 to 1 on input Epsilon

Transition from 0 to 7 on input Epsilon

Transition from 1 to 2 on input Epsilon

Transition from 1 to 4 on input Epsilon

Transition from 2 to 3 on input l

Transition from 3 to 6 on input Epsilon

Transition from 4 to 5 on input e

Transition from 5 to 6 on input Epsilon

Transition from 6 to 1 on input Epsilon

Transition from 6 to 7 on input Epsilon

Transition from 7 to 8 on input Epsilon

Transition from 7 to 10 on input Epsilon

Transition from 8 to 9 on input n

Transition from 9 to 12 on input Epsilon

Transition from 10 to 11 on input Epsilon

Transition from 11 to 12 on input Epsilon

Transition from 12 to 13 on input Epsilon

Transition from 12 to 15 on input Epsilon

Transition from 13 to 14 on input i

Transition from 14 to 17 on input Epsilon

Transition from 15 to 16 on input e

Transition from 16 to 17 on input Epsilon

Transition from 17 to 18 on input e

Transition from 18 to 19 on input Epsilon

Transition from 18 to 21 on input Epsilon

Transition from 19 to 20 on input l

Transition from 20 to 19 on input Epsilon

Transition from 20 to 21 on input Epsilon

Figure 2 - NFA’s transition table for the regex (l|e)*n?(i|e)el*

This is the NFA’s graph representation:

Figure 3 - NFA’s graph representation for the regex (l|e)*n?(i|e)el*

As you see, some states have an eps-transition – eps or epsilon (ε) that represents "nothing" or "no input". This is absolutely valid in an NFA.

Next time we’ll dive into the DFA class that uses a subset machine to construct a DFA based on an NFA.

See you there!

Updated on 5/12/2009 09:54:00 PM

As I finished writing the posts, here goes the list that points to them:

Regular Expression Engine in C# (the Story)
Regex engine in C# - the Regex Parser
Regex engine in C# - the DFA
Regex engine in C# - matching strings

Source code: https://github.com/leniel/RegexEngine

References
[1] Kokholm, Niels; Sestoft, Peter. The C5 Generic Collection Library for C# and CLI. Available at <http://www.itu.dk/research/c5/>. Accessed on May 2, 2008.