Alex Elliott
The internet home of a prospective software engineer
This is my personal blog where I discuss projects that I'm currently working on, work I've recently completed, or write about any topic which has caught my interest in the world of Computing from my studies or from my personal research.
Musings on Syntax Highlighting for Websites
February 6th, 2009
Syntax highlighting can be very important to some websites, particularly those featuring articles on programming practice/theory or pastebin/nopaste websites for collaborative debugging. However, most highlighting packages tend to use pattern matching to attempt to correctly highlight a given document rather than a more accurate but more complex lexical parser and do not have the capacity to use multiple highlighting schemes for a given document internally. If you choose to highlight something as PHP, then only the PHP segments will be syntax highlighted, when it’s plausible that there will also be HTML, XML, Javascript, CSS, etc in the same document.
These are two features lacking from most existing syntax highlighting packages today, and ones that I think would be extremely useful to have in publicly available free software tools. The question is simply whether it is feasible to include them, or whether what we’ve got currently is as good as it’s likely to get.
Pattern Matching versus Lexical Parsing
These are the two main ways of taking a source document and producing a highlighting for it. Pattern matching uses regular expressions to attempt to catch recognisable patterns in the given language which is simpler to produce, but does not guarantee good results. Lexical parsing on the other hand is a much more complex but when done more flexible method for producing a highlighting of some input.
Lexical parsing involves going through the input from start to finish breaking the input up into “tokens”, which are small segments of the input with some associated meta-data. In essence it breaks the code provided down into it’s components: strings, keywords, variables, etc. The power of this model is that while the parser is working it can use its state information on things like scope and context to provide more accurate and more informative details. In fact, a full lexical parser would be able to identify syntax errors and highlight them automatically.
As to providing more information, with tokenised input it would be fairly trivial to note which braces/brackets match one another, and unlike a pattern matching system you can include information from other parts of the program – take the simple example of a C++ typedef, something simple like “typedef vector<string>::iterator vec_iter”, which provides a new shorthand type “vec_iter” as a vector<string> iterator. While a pattern matching model could probably work out that vec_iter was a type, it would not know what it represented, or if it was valid. A lexical parser would be able to add a note saying “this is a vector<string> iterator” provided the typedef was in the provided sample.
Of course, while it is probably a superior method from a functional standpoint, it is significantly more complicated to implement. Which raises the question of whether the benefits are worth the extra outlay of effort required to produce the highlighter. My personal view is that pattern matching for the moment is the better option for things like articles, where we are confident that the input is a valid piece of code – and thus should be fine in a normal highlighter. For uses like pastebin/nopaste sites though, it would be beneficial to have this kind of extra information since they are often used for collaborative debugging, and so highlighting of syntax errors, and other possible errors like a definition of a used type not being available (this might not be a true error as the definition may be in a file not provided to the highlighter, but it could still be worth noting – and it would definitely be useful for self-contained testcases).
Language Nesting in Code Samples
The other limitation in many existing syntax highlighters is that they are not able to apply several different language highlighting schemes to one piece of provided input. This can be annoying when you’re highlighting things like web pages, which can easily contain HTML, with nested CSS (in <style></style>), nested JS (in <script></script>) and perhaps server-side languages like some PHP (in <?php ?>).
For the most part, just selecting one to highlight works, since it’s unlikely that more than one requires significant attention at once, however there are situations where it would be useful to have each block highlighted separately. However, this would either require the user to select ranges of code to highlight in different language engines, or it would require the highlighter to attempt to automatically determine what language segments of code are. The first is tedious for the end-user, and would likely lead to the product not being used, and the latter adds significant complication to the highlighter.
These things are something I would like to see included in the functionality of pastebin/nopaste websites, but due to the complexity I can’t expect them to just turn up one day. So, given that I figure I might give it a go, simply writing a fairly cut-down proof-of-concept to maybe appear with pastesite one day (in C++, not PHP since I expect the performance of PHP to not be capable of this satisfactorily). As to whether I’ll ever finish it, that remains to be seen, but I do think such a product would be beneficial to the programming community as a whole, and I hope if I don’t do it maybe someone else will.
One Response to “Musings on Syntax Highlighting for Websites”
Leave a Reply
Search
Recent Articles
- New Lexing Parsers and What They Mean for Expression Editor
- Back to Work
- Expression Editor Update (2)
- Expression Editor Update
- A New Project
My Projects
- Expression EditorGraphical Regular Expression Editing Tool
- My Pastebin – PasteSite.ComMy Easy Usable NoPaste/Pastebin Application
Blogroll
- Binary VisionThe Technology Musings of Ed Cradock
- Tim Davies sans vowelsThe Blog of Tim Davies
Other Sites
- HideMyAss – Free Proxy and Privacy ToolsHide My Ass! Free Proxy and Privacy Tools – Surf The Web Anonymously
- Iam-andy – Design PortfolioThe Design Portfolio of Andy Mallaby
- Zymic – Free Webhosting and MoreFree Web Hosting, Free Templates, Free Tutorials and more! Zymic Webmaster Resources
Archives
- February 2012 (1)
- July 2010 (1)
- January 2010 (1)
- December 2009 (2)
- November 2009 (1)
- February 2009 (1)
- January 2009 (1)
- December 2008 (1)
- November 2008 (5)
Categories
- arbutus (1)
- c++ (5)
- general (4)
- musing (1)
- php (5)
- polymer (4)
- zbot (3)
- projects (12)
- expressioneditor (5)
- qt (5)
- regular expressions (4)
- web (5)
May 23rd, 2009 at 9:30 pm
Hi, nice posts there
thank’s for the interesting information