Alex Elliott

The internet home of a prospective software engineer

This is my personal blog where I discuss projects that I'm currently working on, work I've recently completed, or write about any topic which has caught my interest in the world of Computing from my studies or from my personal research.

Latest Articles

New Lexing Parsers and What They Mean for Expression Editor

February 2nd, 2012

Back in September (prior to some uni work that rather soaked up my free time) I pushed a fairly major restructuring of Expression Editor to the repository. Functionally it has meant a bit of a step back, but the new structure does mean some significant improvements to how it works now, and provides scope for some things which would not have been possible before. I thought that since this has been neglected for a while, I might as well give some details on it while I still have a window of opportunity.

What has changed?

There is one major difference in how it operates now compared to how it worked before. Prior to this the parsing of regular expressions was rather ad-hoc, it was a process that grew as I built in support for more different syntax elements. The result of that was in the long term it would get increasingly difficult to ensure things were being parsed correctly, it also did not react well to being passed an invalid expression – it would tend to just fail and so the display did not update while the expression was invalid.

To improve on that ad-hoc approach, I replaced it with a single consistent method – a lexing parser. The system defines a range of tokens (over 100 different ones) from simple literals (T_LITERAL), through common syntax elements like ^ (T_STARTING_POSITION), through to less commonly used syntax elements like “(?<!” (which is of course T_NEGATIVE_LOOKBEHIND_ASSERTION_OPEN). The tokens available span across regular expression formats and provide a unified representation within Expression Editor to work with.

Of course there need to be regular expression backend specific parsers which convert their respective formats into a sequence of unified tokens. This is handled via a polymorphic set of parser classes based on the Parser base class (parser.cpp/parser.hpp), the general logic of the parser exists in the base class. It uses a map of tokens in the regular expression format onto regular expressions describing that syntax to find the longest match it can (in the case where some matches are of the same length, the first match is used). The result is then passed to handleToken() which handles the logic that should apply whenever a token of that type is found (for example, when a T_GROUPING_OPEN is found, it should continue to consume tokens until a T_GROUPING_CLOSE is found, if it reaches the end of the sequence without finding one, that T_GROUPING_OPEN should be reassigned the type T_ERROR as it is not balanced).

This structure – as mentioned in the last example given – is capable of handling invalid syntax correctly, and it’s fairly simple to implement new parsing backends as the logic is very consistent. The shared format is also very useful as it means that any new backends which use a different format will be easier to integrate.

What impact does this change have?

Well, initially it means a lot of things are no longer working, that’s unfortunate but on balance I think something that had to be done. At the time of writing two of the three testing widgets are no longer functional, the save/load/common/recent files is no longer in place and the visual editing which was there is not any more.

Not everything is bad news though, there are already some new features made easier by this system. The visualisation is now capable of updating even while there are errors in the expression. It will simply mark them as T_ERROR via the parser, and they will appear as red blocks until they are fixed – this means you can now spot errors via the visualisation. The amount of configuration possible has also increased, the settings dialog now provides a set of options for how the visualisation is presented and the aim is to keep adding further options as visual elements are added to the system. Allowing for the appearance to be tailored entirely to your specifications.

There are also more possibilities for the future than there were previously, by converting the regular expression to a shared format it opens up scope for regular expression optimisers and translators in the future. It would be nice if you could take an expression you had already written for PCRE and have the application convert it into say POSIX ERE for a system where PCRE is not supported or available. That would be a lot easier with this unified token sequence, as it would simply be a case of translating in the opposite direction to the original parser. Some elements may not be possible due to limitations of the format, but nevertheless it would be a useful feature to have.

What work is going on at the moment?

Not as much as I would like, really. I’ve got other stuff on via university – but I do want to try and bring back the features that have gone missing since the restructuring, and then I can get back to bringing in new functionality which would make this a better piece of software than before the restructuring. In the meantime the previous application is tagged in the repository, so the functionality is still available. Hopefully though it won’t need to be around too much longer.

Oh, a final passing point – another thing that has changed since the last time I wrote is that the CMake build system is now capable of producing an NSIS executable installer for Expression Editor, there is one available via GitHub which contains the new build of Expression Editor, and I’ll try to keep it as up to date as possible. I’d be interested to talk with people who’ve deployed Qt/C++/CMake to OSX before to get a suitable build system working there as well. It shouldn’t be too hard to do, since CPack does have package targets for various OSX installers it’s just not something I’ve done before.

Back to Work

July 29th, 2010

Well, been a while since the last blog post I’ve written, and a similar stretch of time since I really got much work done on ExpressionEditor. But as you might have guessed since the blog is winding back into life, so to is my work on EE.

Changes So Far

I’m still working my way slowly through the todo list for EE (updated now on the wiki.github home page for EE), major differences that I’ve already managed to cross off include adding support for the ICU regular expression library, and migrating the build process over to CMake (which is still partly ongoing, the more people build it on as many platforms as possible the better – please do report any issues, in IRC would be favourite as I can work through the issues directly).

ICU (International Components for Unicode) supports a regular expression engine that seems to be a popular choice particularly for Mac-based programming, and appears to be fairly full-featured.  I hope that ICU support will be useful to people, and that it gets good use.

As to the migration to CMake, this is for several reasons: it should make it easier to distribute at the moment it supports a very basic “make install” target, and I will be expanding that to bring the common files into /etc (and at some point I should improve the content of those common files, but that’s something I can do later) and also expanding that to take the README, CREDITS, and similar files into documentation.

It should also now correctly auto-detect which of the optional dependencies (all of the regular expression libraries bar that included in Qt4.6+) are present on the system, and build a version of EE which supports the ones which are available.  This hopefully will make it less of a pain for those who don’t have PCRE or ICU (or in the case of Win32 POSIX’s regex.h).

And talking of making it easier for Win32, after a fairly long period where no EE was compiled on Win32 it was recently built, at the least proving that the code is still very happy to compile there when it’s not missing optional deps – which is very good news to get. :-)

Looking Forward

Of course, there’s still quite a lot to do to reach the plans I’ve set out, and since I’m working full-time I don’t have as much time as I’d like to work on EE (which I hope you’ll agree is an interesting and quite fun little tool, and it’s just as nice to work on :-) ).  But I expect to make some progress nonetheless, and if I do, I’ll keep all one maybe two people who have this in their RSS readers up to speed. ;-)

Expression Editor Update (2)

January 24th, 2010

Since I’ve had some more time to work on Expression Editor recently I thought it was about time I wrote another update for the progress of the project, and some related news that affects it.

Expression Editor on Mac OSX

A recent screenshot of Expression Editor on Mac OSX

From Last Time…

In the previous post I noted a few areas in progress and some that I wanted to look at in the future.  So to catch up there, Drag&Drop is generally a bit more reliable and produces slightly neater results but is otherwise unchanged so far, and the new testing widget is still waiting.  A significant change has been made in the area of supported regular expression formats however.

The application now has backends for Qt4, PCRE and POSIX ERE formats (though the visualisation could still mess up some PCRE/POSIX elements, let me know if anything breaks).  You can select the format you wish to work in from the menu bar, it will be displayed in the bottom right of the screen so you know which mode it is currently in, and the save format has been slightly extended to save your preference for each particular expression.

The default mode has also been changed to PCRE, since it is probably the most powerful backend available.  Another minor UI change has been included which is an expression status indicator to the right of the text input.  A green tick while valid, and a red exclamation mark when invalid, in addition if you mouseover the invalid indicator, the tooltip is the error returned from the active regular expression backend.

In Related News

As you probably saw above the screenshot used is from Mac OSX.  In order to improve my capacity to test Expression Editor I’ve gotten myself a Mac Mini as well as my Slackware Linux laptop.  Set up with Synergy+ this means I can simultaneously develop the application in Linux and test it in OSX.  One behavioural difference between the two operating systems has already been resolved, so hopefully the application should start behaving much more reliably on OSX as well as Linux from now on.

Expression Editor Update

December 24th, 2009

A fair bit of progress has been made since my last blog entry so I thought I’d note a few things that have landed in the repository and a few things that I intend to add at a later date.

Drag and Drop

Initial support for drag and drop editing has been added.  You can now re-order the elements of the expression by dragging an element in the visualisation to one of the valid drop zones (which are automatically highlighted as you can see in this screenshot).  With this in place it becomes significantly easier to add the other bits of drag/drop editing I want the editor to support.  Eventually as well as reordering (plus the double-click edit dialogs which are also currently included for several elements) I aim to include:

  • Drag/drop adding of new elements from the toolbar to the left of the visualisation.  This should probably spawn a dialog/wizard and then insert the resulting regular expression element into the current expression.
  • Reordering needs more support in the alternatives item, currently there are only valid drop zones to place items inside current alternation branches, and there should be a drop zone allowing the user to drop an element in as a new alternative.
  • Possibly a simple “trash” element, which simply accepts the drop, and results in the item being deleted from the scene.

Regexp Formats

As stated in a few places in the application, before the initial release I hope to support PCRE, POSIX Extended and Qt format regular expressions.  This means supporting a range of different regexp syntaxes, and intelligently warning when switching between formats if some of the expression cannot be used directly in the new format, it should also offer to try to translate the expression if such a problem exists.

For example, if we’re currently in PCRE mode and we have an expression containing “\w” and we switch to POSIX Extended, this should trigger a warning and then offer to translate, turning “\w” into “[[:word:]]”.

At the moment, the application only supports Qt’s internal format, and I think correctly represents much of what it supports internally.  The format is very much  like a slightly restricted PCRE format, so Qt/PCRE conversion should be fairly straightforward.

Expression Testing

The editor currently has an element at the bottom of the layout which allows you to test the regular expression for given short strings.  This is good for most cases, since it allows you to have a few regexp “unit tests” of sorts, where you test fringe cases and observe if it matches, partially matches, and whether the capture groups work as expected.

In addition to this it would be useful to have a few other methods of testing included.  The testing widget should eventually be a tabbed widget with the currently available tester as an option, then also having at least two additional panes.  A “bulk text” pane which  takes paragraph or longer inputs of text and highlights all instances of that section which are matched by the regular expression, and a “replacement” pane which allows you to input a similar length input to “bulk text”, and apply the regular expression with a given replacement string (which could also be a regular expression).

Anyway, that’s what I’ve been working on and some of what I want to include later.  Work goes on. :)