HTML to XHTML using SGMLReader

Chris Lovett wrote a wonderful .NET library called SGMLReader, which,
when passed a regular HTML file will spit out XHTML.  The library
relies on SGML parsing and uses a DTD (Document Type Definition) file to parse
unformatted HTML. 

I’ve been playing with Chris’s library this
week and trying to convert HTML to XHTML on the fly using an ASP.NET
Http Filter. The code below consists of the following:

  • An HttpModule, which sets the filter property of the response object for all ASPX page requests.
  • An Http filter, which is a custom stream, to manipulate the HTML content for the requested page.
  • The entry in the web.config file, required for the module to operate.

So, how does the code work?

To understand my code, you should know a little about how Http modules
work in ASP.NET. I am not going to breach this subject in this post,
but you can out all you need to know about Http modules on MSDN
My module code interjects with each incoming web request sent through
IIS, checks to see if an ASPX page has been requested, and if so, sets
the filter property of the response object to a new instance of XhtmlFilter.

When ASP.NET is ready to push processed HTML content back to the client
browser, via IIS, it uses the filter property in the response
object.  The default filter used by ASP.NET is -:
System.Web.HttpResponseStreamFilterSink.  However, it is possible
to reassign the filter property a custom filer, which will perform so
additional processing before forwarding the content to the default
filter.  This exactly what my filtering code does.  My XhtmlFilter
class is a stream derived class, which captures incoming HTML data from
ASP.NET, converts it to XHTML using SGMLReader, and then forwards the
changed content to the default filter.

The XhtmlFilter class inherits from a base class – HttpFilterBase, which abstracts away the stream functions not supported by Http
filters.  Inside my filter class I make use of a MemoryStream
object to capture all the HTML content pushed by the framework before
processing content with SGMLReader. As the SGMLReader parses the HTML
data the converted XHTML is streamed out to the default filter, and out
to the client browser.

The code follows….

XhtmlFilterModule.cs

using System;
using System.Web;

namespace HTML2XHTML.HttpFilters
{
public class XhtmlFilterModule : IHttpModule
{
public void Dispose() {}

public void Init(HttpApplication context)
{
context.BeginRequest += new EventHandler(ModuleBeginRequest);
}

private void ModuleBeginRequest(object sender, EventArgs e)
{
// See if request is for a ASPX page.
HttpRequest request = HttpContext.Current.Request;
if (request.Url.AbsolutePath.EndsWith(“.aspx”))
{
// Install the filter.
HttpResponse response = HttpContext.Current.Response;
response.Filter = new XhtmlFilter(response.Filter);
}
}
}
}

XhtmlFilter.cs

using System;
using System.IO;
using System.Text;
using System.Xml;
using Sgml;

namespace HTML2XHTML.HttpFilters
{
public class XhtmlFilter : HttpFilterBase
{
private MemoryStream _memStream = null;
private BinaryWriter _writer = null;

public XhtmlFilter(Stream baseStream) : base(baseStream)
{
_memStream = new MemoryStream();
_writer = new BinaryWriter(_memStream);
}

public override void Write(byte[] buffer, int offset, int count)
{
// Check if stream is open.
if (Closed)
throw new ObjectDisposedException(“XhtmlFilter”);
// Write to the memory stream.
_writer.Write(buffer, offset, count);

}

public override void Flush()
{
_writer.Flush();
}

public override void Close()
{
// Seek to the beginning of the memory stream.
_memStream.Seek(0, SeekOrigin.Begin);
// All output has been written to the stream – process the HTML.
SgmlReader sgmlReader = new SgmlReader();
StreamReader streamReader = new StreamReader(_memStream);
XmlTextWriter xmlWriter = new XmlTextWriter(BaseStream, Encoding.UTF8);
sgmlReader.CaseFolding = CaseFolding.None;
sgmlReader.DocType = “HTML”;
sgmlReader.InputStream = streamReader;
sgmlReader.Read();
while (!sgmlReader.EOF)
xmlWriter.WriteNode(sgmlReader, true);
// Close the writer.
xmlWriter.Flush();
xmlWriter.Close();
// Close the reader.
streamReader.Close();
sgmlReader.Close();
// Close the base version.
base.Close();
}
}
}

HttpFilerBase.cs

using System;
using System.IO;

namespace HTML2XHTML.HttpFilters
{
public abstract class HttpFilterBase : Stream
{
private Stream _baseStream;
private bool _closed;

protected Stream BaseStream
{
get { return this._baseStream; }
}

public override bool CanRead
{
get { return false; }
}

public override bool CanWrite
{
get { return !_closed; }
}

public override bool CanSeek
{
get { return false; }
}

protected bool Closed
{
get { return _closed; }
}

public override long Length
{
get { throw new NotSupportedException(); }
}

public override long Position
{
get { throw new NotSupportedException(); }
set { throw new NotSupportedException(); }
}

protected HttpFilterBase(Stream _baseStream)
{
this._baseStream = _baseStream;
this._closed = false;
}

public override void Close()
{
if (!_closed)
{
_closed = true;
_baseStream.Close();
}
}

public override void Flush()
{
_baseStream.Flush();
}

public override int Read(byte[] buffer, int offset, int count)
{
throw new NotSupportedException();
}

public override long Seek(long offset, SeekOrigin origin)
{
throw new NotSupportedException();
}

public override void SetLength(long value)
{
throw new NotSupportedException();
}
}
}

Module entry in web.config

6 thoughts on “HTML to XHTML using SGMLReader

  1. Chris Hynes

    I’m currently implementing a filter of my own. Thanks for the great code.
    <br>
    <br>One thing you may want to do is pass Response.Encoding to the filter rather than assuming it’s UTF-8.

  2. Phil Morris

    Why not just make sure the original ASPX files are XHTML compliant? SgmlReader will kill ASP.NET controls so I wrote AspxCleaner as a wrapper around Sgmlreader to ‘protect’ the ASP.NET controls and other .NET constructs that would otherwise be destroyed. You can get it at http://www.dotnetwebtools.com, and it’s free! I haven’t go so far as write code to replace deprecated HTML tags, but will make my source code available to anyone interested in doing so.

  3. http://

    Great to have people willing to share their work. Love to see the source code for Phil Morris AspxCleaner. Could you make the source available from your website?

  4. Skydiver420

    Hi !

    I recently came across the interresting SgmlReader from Chris Lovett but
    I had 2 major issues !

    GoDotNet.com has been shut down…. And the sample is no where to be found in msdn

    Posting comments on Chris’s web site produce an error (access denied)

    So can anybody put back the code…

    Thanks !

  5. Steve Bjorg

    Quick post to point out the 1.8.2 release of the .NET SgmlReader library on SourceForget.Net (http://sourceforge.net/project/showfiles.php?group_id=173074&package_id=246977&release_id=644363) courtesy of MindTouch (http://www.mindtouch.com).

    Since the original release a few things have changed… for the better. GotDotNet might not exist anymore (the link in the article is now broken), but SgmlReader has continue to evolve nicely nonetheless.

    The latest version (v1.8.2) has been released Dec. 1 2008 on SourceForge.Net. Also, check out the community page to catch up on all the improvements (http://wiki.developer.mindtouch.com/Community/SgmlReader).

    Cheers,
    – Steve

Comments are closed.