Chris Lovett wrote a wonderful .NET library called SGMLReader, which,
when passed a regular HTML file will spit out XHTML. The library
relies on SGML parsing and uses a DTD (Document Type Definition) file to parse
unformatted HTML.
I’ve been playing with Chris’s library this
week and trying to convert HTML to XHTML on the fly using an ASP.NET
Http Filter. The code below consists of the following:
- An HttpModule, which sets the filter property of the response object for all ASPX page requests.
- An Http filter, which is a custom stream, to manipulate the HTML content for the requested page.
- The entry in the web.config file, required for the module to operate.
So, how does the code work?
To understand my code, you should know a little about how Http modules
work in ASP.NET. I am not going to breach this subject in this post,
but you can out all you need to know about Http modules on MSDN.
My module code interjects with each incoming web request sent through
IIS, checks to see if an ASPX page has been requested, and if so, sets
the filter property of the response object to a new instance of XhtmlFilter.
When ASP.NET is ready to push processed HTML content back to the client
browser, via IIS, it uses the filter property in the response
object. The default filter used by ASP.NET is -:
System.Web.HttpResponseStreamFilterSink. However, it is possible
to reassign the filter property a custom filer, which will perform so
additional processing before forwarding the content to the default
filter. This exactly what my filtering code does. My XhtmlFilter
class is a stream derived class, which captures incoming HTML data from
ASP.NET, converts it to XHTML using SGMLReader, and then forwards the
changed content to the default filter.
The XhtmlFilter class inherits from a base class – HttpFilterBase, which abstracts away the stream functions not supported by Http
filters. Inside my filter class I make use of a MemoryStream
object to capture all the HTML content pushed by the framework before
processing content with SGMLReader. As the SGMLReader parses the HTML
data the converted XHTML is streamed out to the default filter, and out
to the client browser.
The code follows….
XhtmlFilterModule.cs
using System;
using System.Web;
namespace HTML2XHTML.HttpFilters
{
public class XhtmlFilterModule : IHttpModule
{
public void Dispose() {}
public void Init(HttpApplication context)
{
context.BeginRequest += new EventHandler(ModuleBeginRequest);
}
private void ModuleBeginRequest(object sender, EventArgs e)
{
// See if request is for a ASPX page.
HttpRequest request = HttpContext.Current.Request;
if (request.Url.AbsolutePath.EndsWith(“.aspx”))
{
// Install the filter.
HttpResponse response = HttpContext.Current.Response;
response.Filter = new XhtmlFilter(response.Filter);
}
}
}
}
XhtmlFilter.cs
using System;
using System.IO;
using System.Text;
using System.Xml;
using Sgml;
namespace HTML2XHTML.HttpFilters
{
public class XhtmlFilter : HttpFilterBase
{
private MemoryStream _memStream = null;
private BinaryWriter _writer = null;
public XhtmlFilter(Stream baseStream) : base(baseStream)
{
_memStream = new MemoryStream();
_writer = new BinaryWriter(_memStream);
}
public override void Write(byte[] buffer, int offset, int count)
{
// Check if stream is open.
if (Closed)
throw new ObjectDisposedException(“XhtmlFilter”);
// Write to the memory stream.
_writer.Write(buffer, offset, count);
}
public override void Flush()
{
_writer.Flush();
}
public override void Close()
{
// Seek to the beginning of the memory stream.
_memStream.Seek(0, SeekOrigin.Begin);
// All output has been written to the stream – process the HTML.
SgmlReader sgmlReader = new SgmlReader();
StreamReader streamReader = new StreamReader(_memStream);
XmlTextWriter xmlWriter = new XmlTextWriter(BaseStream, Encoding.UTF8);
sgmlReader.CaseFolding = CaseFolding.None;
sgmlReader.DocType = “HTML”;
sgmlReader.InputStream = streamReader;
sgmlReader.Read();
while (!sgmlReader.EOF)
xmlWriter.WriteNode(sgmlReader, true);
// Close the writer.
xmlWriter.Flush();
xmlWriter.Close();
// Close the reader.
streamReader.Close();
sgmlReader.Close();
// Close the base version.
base.Close();
}
}
}
HttpFilerBase.cs
using System;
using System.IO;
namespace HTML2XHTML.HttpFilters
{
public abstract class HttpFilterBase : Stream
{
private Stream _baseStream;
private bool _closed;
protected Stream BaseStream
{
get { return this._baseStream; }
}
public override bool CanRead
{
get { return false; }
}
public override bool CanWrite
{
get { return !_closed; }
}
public override bool CanSeek
{
get { return false; }
}
protected bool Closed
{
get { return _closed; }
}
public override long Length
{
get { throw new NotSupportedException(); }
}
public override long Position
{
get { throw new NotSupportedException(); }
set { throw new NotSupportedException(); }
}
protected HttpFilterBase(Stream _baseStream)
{
this._baseStream = _baseStream;
this._closed = false;
}
public override void Close()
{
if (!_closed)
{
_closed = true;
_baseStream.Close();
}
}
public override void Flush()
{
_baseStream.Flush();
}
public override int Read(byte[] buffer, int offset, int count)
{
throw new NotSupportedException();
}
public override long Seek(long offset, SeekOrigin origin)
{
throw new NotSupportedException();
}
public override void SetLength(long value)
{
throw new NotSupportedException();
}
}
}
Module entry in web.config