I-Filters

Today Carved out a chunk of the day to work on I-Filters. I-Filters are COM
dynamic link libraries that convert known file types to text under
Windows XP/2K/2K3. The OS’s indexing service uses I-Filters to convert
PDF and Office file types to text so the indexer can tokenize words
contained in files.

I wrote a test application that calls an I-Filter
library given a file name and converts it to text. The correct filter is determined by
examining the file extension and querying the registry (I-Filters are
registered with associated file extensions). My code works great with
Office documents but barfs when using Adobe’s 6.0 I-Filter.

Below is a synopsis of the method that does the work of invoking the
filter (leave a comment if you want the rest of the code). The CLSID is
the class ID of the filter, read from the registry.

(Apologies for no syntax highlighting)

private static string ExecuteFilter(string clsID, string sourceFile)

{

  string result = String.Empty;

  // Some filters are not reentrant, such as Adobe PDF filter.

  lock(_lock)

  {

    object itfc = null;

    try

    {

      // Get the filter type from CLSID.

      Type t = Type.GetTypeFromCLSID(new Guid(clsID));

      if (null != t)

      {

        // Get filter instance.

        itfc = Activator.CreateInstance(t);

        // Cast to IPersistFile.

        IFilter ifilt = (IFilter)(itfc);

        System.Runtime.InteropServices.UCOMIPersistFile ipf =

           (System.Runtime.InteropServices.UCOMIPersistFile)(ifilt);

        // Load source.

        ipf.Load(sourceFile, 0);

        // Initialize.

        uint i = 0;

        int hr = 0;

        STAT_CHUNK chunk = new STAT_CHUNK();

        ifilt.Init(IFILTER_INIT.NONE, 0, null, ref i);

        // Read the in chunks.

        StringBuilder masterBuffer = new StringBuilder();

        while (0 == hr)

        {

          // Read next chunk structure.

          try

          {

            hr = ifilt.GetChunk(out chunk);

          }

          catch (COMException ex)

          {

            //
Get Chunk will throw an exception
            // when no more chunks to read – tsk.

            if (FILTER_E_END_OF_CHUNKS == ex.ErrorCode)

              hr = ex.ErrorCode;

            else

              throw ex;

          }

          // if chunk is text..

          if (0 == hr && CHUNKSTATE.CHUNK_TEXT == chunk.flags)

          {

            // Read text to buffer.

            uint bufferSize = CHUNK_SIZE;

            int hr2 = 0;

            while (FILTER_S_LAST_TEXT != hr2 || 0 == hr2)

            {

              bufferSize = CHUNK_SIZE;

             
StringBuilder buffer = new StringBuilder((int)bufferSize);

              hr2 = ifilt.GetText(ref bufferSize, buffer);

             
masterBuffer.Append(buffer.ToString(0, (int)bufferSize));

            }

            // Did we get an error?

            if
(FILTER_E_NO_MORE_TEXT != hr2 && FILTER_S_LAST_TEXT != hr2)

             
throw new Exception(“Failed reading data from chunk!”);

          }

        }

        // Assign result.

        result = masterBuffer.ToString();

      }

    }

    catch (Exception ex)

    {

      throw new FileLoadException(“Failed to read data from filter!”, ex);

    }

    finally

    {

      if (null != itfc)

        Marshal.ReleaseComObject(itfc);

    }

  }

return result;

}

29 thoughts on “I-Filters

  1. Chris Hynes

    Been checking around and I found a couple of hints as to what might be going on. First of all, the Adobe IFilter is apartment threaded. I see a lot of people recommending to change it’s ThreadingModel registry key to Both. See this page for more: <a target=”_new” href=”http://www.adobe.com/support/techdocs/327014.html”>http://www.adobe.com/support/techdocs/327014.html</a&gt;.
    <br>
    <br>Also, I found this thread (<a target=”_new” href=”http://sqljunkies.com/WebLog/acencini/articles/716.aspx”>http://sqljunkies.com/WebLog/acencini/articles/716.aspx</a&gtπŸ˜‰ which has a bunch of info, as well as a recommendation to check whether the buffer returned was the right size… Can you check for Adobe IFilter returning a smaller sized chunk of text the last time and know that it has completed at that point?

  2. Rob Garrett

    Cool, thanks, I’ll check out these links.
    <br>
    <br>I tried commenting out some of the code to see what is causing the filter to go bad. It seems that just instantiating the object using Activator.CreateInstance(t) is enough to cause the exception after the object goes out of scope.
    <br>
    <br>I’ll investigate further ….

  3. http://

    IFilters have been driving me insane for an embaressing amount of time, so any code you have would be greatly appreciated. Other examples I’ve found mysteriously conclude different IFilters for two files of the same type, and C#/Interop is not my strong point.

  4. Chris Anzalone

    Rob
    <br>
    <br>Thanks for posting the code. I just downloaded the Adobe IFilter 6.0 and plan to dig in. I would appreciate any other code you are willing to share on the subject as I am pretty inexperienced in this topic.
    <br>
    <br>Just to highlight that point… I’m guessing that the reason this works well with Microsoft Office documents is that you have Office (2003?) installed on the machine and the code is able to find the appropriate CLSID from the registry?
    <br>
    <br>Or is it that just having indexing service installed is enough to locate the appropriate IFilter for these documents?
    <br>
    <br>Anyway, thanks again.

  5. http://

    Rob
    <br>
    <br>Thanks for the post. I am trying to convert a folder with several document types (PDF, DOC…) to HTML, I think your code could help me to do this.
    <br>
    <br>Raynald

  6. http://

    I download IFilter 6.0 and am convert pdf files to text. However, when my application exits, it crashes with the following message:
    <br>
    <br>The instruction at β€œ0X02e161b3” referenced memory at β€œ0x038027b0”. The memory could not be β€œread”.
    <br>
    <br>
    <br>
    <br>Click OK to terminate the program. The popup window has the title: Font Capture: ConvertPDF.exe – Application Error
    <br>
    <br>
    <br>
    <br>I am using Visual Studio 2005 and my .NET version 2.0.50215
    <br>
    <br>
    <br>

  7. shweta

    i am shweta , working in the span services . persently working on the project which is basically a search engine. so we r planning t implemenat IFilter.
    <br>can u pls share with me, how efficient is IFilters ?, what are the known and unknown issues regading IFilters, and any other problem reklated IFIlter’s
    <br>
    <br>Regards
    <br>
    <br>shweta
    <br>
    <br>

  8. Rob Garrett

    IMOH IFilters are not the best technology in the world – I find that each IFilter implementation is slightly different to the next, even though they support the same interface. So getting them to work can be a trial and error exercise. It also depends on what programming language you’re planning on using. IFilters are COM based, so there’s the headache of making COM interop calls from your application. C# does a great job, but if the threading model is not correct, some IFilters will blow up. Of course, if you’re concerned with efficiency and performance, COM will be a bottleneck for you.
    <br>
    <br>If you’re implementing a search engine, I would suggest IFilters for offline indexing – index in batch and store the indexed data in a repository, which can be searched faster online. Using IFilters for dynamic searching would not be efficient.

  9. Sandeep Krishna

    Hi Rob Garrett,
    <br>
    <br> i’m working on a search Engine product. Here i need to even look into .mpp file (Microsoft Project Plan) .. We found the iFilters the best way to extract text and are using the Samples provided in this site.. but Couldnot locate the best iFilter for .MPP files.. very less work is done on this file type.. We found one named &quot;msprofilt_w.dll&quot; but isn’t worth and not serving our purpose..
    <br>
    <br> Can u plz help me out with the best iFilter for .mpp file if u have come across any..
    <br>
    <br>my email id is mentioned in URL.. it will be a great help from u r side if u could suggest one..

  10. Rob Garrett

    Unfortunately, this is where I ran into problems also. I was able to write infrastructure code to make use of IFilters, but had a hard time finding IFilter DLLs for certain file types that existed and worked. I never came across a filter for MPP files. Sorry I could not be any further help.

  11. shweta

    Hi Rob
    <br>
    <br>can u pls share your experience of working iFilter DLL.
    <br>how much efficient the ifilter are what are the issues that u faces(if any).
    <br>any constrains , any other problem regarding memeory , processing speed or multithreaded environment.
    <br>
    <br>as i explain u earlier also we are working on the search engine which is going to index GB’s(500 GB) of data. so lots of thread and memery and process will come in picture. so pls share your experience of working with ifilter
    <br>

  12. Rob Garrett

    Hi Shweta,
    <br> This blog post, it’s comments, and any software constitute as my sole experiences with IFilters. Below is a summary with regard to your specific concerns:
    <br>
    <br>1. IFilters are NOT efficient – period. You have to consider the COM Interop and Marshalling bottle neck, and you’re also at the mercy of whoever implemented the IFilter. If performance is key, then the best approach is to extract the text data from the file directly.
    <br>
    <br>2. Issues that I have experienced are (and not limited to) – bad performance, IFilters crashing (sometimes due to the threading model, sometimes not), some IFilters do not adhere to the spec, and others require some registry hacking to get them to work. On a general level, not all filters return all data – the Visio filter only returns sparse text data.
    <br>
    <br>3. Memory and processing has not been a big issue for me (except for the fact that IFilters are not particularly fast). My colleagues and I wrote a thread based system, which would convert files with IFilters concurrently. We limited the number of concurrent jobs to prevent memory and CPU usage explosion. Once again, memory and CPU usage is at the mercy of the implementation of the IFilter.
    <br>
    <br>4. Do you know what file types you’re planning on indexing? You’re indexing a lot of data – if the data type is pretty much static, then I would advice that you index the data in the files directly. (Data indexing can take a while, without the overhead of IFilters and COM). If you’re trying to index any and every file type on the system, then you may be limited in options. In which case the trade off between performance and convenience may be better justified with IFilters.

  13. shweta

    Hi Rob
    <br>
    <br> Thanks for sharing your valauble web log with us.
    <br>
    <br>Actually we have to index on the 16 types of different format like :
    <br>
    <br> 1. Text File
    <br> 2. HTML File
    <br> 3. PDF (Acrobat)
    <br> 4. RTF
    <br> 5. Word Perfect (Corel)
    <br> 6. Microsoft Word
    <br> 7. Microsoft Excel
    <br> 8. Microsoft Access
    <br> 9. Microsoft PowerPoint
    <br> 10. Microsoft Visio
    <br> 11. Microsoft Project
    <br> 12. OutLook PST (Microsoft)
    <br> 13. EML (Microsoft)
    <br> 15. NSF – Mail Client
    <br> 16. ZIP
    <br>
    <br>in which for near by 12 format we got IFilters. and as i explain earlier database will be in GB’s. so we have write thread based system which concurrently read numbers of files at a time. even i feel IFIlter behaviour are frequently changing. sometime it work fine or sometime crash. and our system that we are designing will be total automated. than i will becoem a issue.
    <br>
    <br>can you help me out or suggest me the alternative way to extraxt text from all these file types. because we can’t directly read the file.
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>

  14. sunny

    Hi Rob,
    <br>
    <br> i’m looking a iFilter to fetch data from .mdb file. Please help me for this. Is there any other efficient way to fetch detials from .mdb file apart from iFilters would be a great help.

  15. Rob Garrett

    Hi sunny,
    <br> I wish I could help you with extracting data from MDB files, however, my post is only concerned with using IFilters that exist – not about implementing them. I too have had the same problem finding an IFilter to extract data from MDB files.
    <br>
    <br>Bottom line: Microsoft don’t really want you snooping around in their proprietary application data files because you’d be able to read them from other non-MS applications.
    <br>
    <br>Good luck.

  16. http://

    Hi Rob,
    <br>I am developing a web based application in which i give an option to the registered users of the applications an option to upload files as attachments directly in the blob field of the database. Now Just before the upload of the file occurs i need to extract text from the files and store it as a column in the database. Now i searched a lot About IFilters and came up with a lots of solutions. I just need to detect the extension of the filename that is being uploaded ( mostly all are Microsoft Office Documents). Hence i have to use OffFilt.dll which is the Ifilter for Microsoft Office Files. Now i have to solve the issue by not searching through the registry for the CLSID of the OffFilt.dll and then create an instance and then get chunks and then get the text and properties of the office document. If there is a way to get it directly from the Indexing Server, then i would be grateful.
    <br>
    <br>
    <br>Thanks in advance
    <br>Kunal Ramesh Lalwani
    <br>Software Developer
    <br>Fahm Softwares
    <br>India

  17. shweta

    Hi kunal,
    <br>
    <br> if u want to know to what type of file for loading particular IFilter for that type of file, so no need becuase IFilter itself internally identify(LoadIFilter) which ifilter it sholud load for particular type of files.
    <br>

  18. http://

    Hi Rob
    <br>
    <br>Can you put some code on how to read properties of document.
    <br>I guess the code has to be extended like
    <br>
    <br>else if (0 == hr &amp;&amp; CHUNKSTATE.CHUNK_VALUE == chunk.flags)
    <br>{
    <br> // Need complete code to do ifilt.GetValue()
    <br>}

  19. http://

    Hi.
    <br>I am trying to extract text from Excel documents using Microsoft IFilter (OFFFilt.dll)
    <br>
    <br>My problem is with the extraction order – the extracted text is not in the order it appears in spreadsheet (down than over / over than down), I think it is related to the time the cells were created/modified.
    <br>
    <br>Can I control that extraction order?
    <br>Thanx alon.

  20. http://

    I gone through your article. can any body help me to provide sample code for extraction text from a given file. When I check the above code I am getting error for files with extension TIFF, PUB…etc. for pdf when i trying to extract code after get the code it is trowing an error.
    <br>
    <br>I seen some body by name swetha working on a similar sort of application which i am currently working on. So please can you provide me code samples which you ppl develop.
    <br>
    <br>Help me my friends. I am in deep deep trouble in completing this task.
    <br>I am very poor in COM Interop and Marshalling.
    <br>
    <br>Please help me.
    <br>Thank you in advance for your help
    <br>
    <br>my mail id is gvkraj23@gmail.com
    <br>
    <br>
    <br>

  21. shweta

    Hi rob,
    <br>
    <br>have u faced this kind of problem before like HTML IFIlter is not able to extract the tes in the java script tag like what ever in Document. write()
    <br>
    <br>Let ke know if u have some solution
    <br>
    <br>Thanks
    <br>shweta

  22. Swati Sinha

    Hi everyone!
    <br>
    <br>I am developing a search engine application in asp.net by calling MS’s indexing service.This works fine for office docs and HTML files but for.pdf and for.zip files,i guess we gotta download the IFilters and integrate it with our documents.
    <br>
    <br>I am new to this world of IFilters and just followed the instructions given on the following link.
    <br><a target=”_new” href=”http://www.developerland.com/dotnet/enterprise/382.aspx”>http://www.developerland.com/dotnet/enterprise/382.aspx</a&gt;
    <br>
    <br>I added the dll to the registry and did everything that was given but my application doesn’t seem to work.
    <br>
    <br>Can somebody please help?
    <br>
    <br>cheers,
    <br>swati

  23. http://

    Hi everyone,

    I am working on ifilter to extract properties of files.
    I ran into a problem with ifilter. The problem is, using ifilter I can extract properties of MS 2007 document (eg, *.docx) but I could not read any property of MS 2003 document (eg, *.doc). This may be because of their different file format. For *.doc, getChunk() function never returns Chunk_Value. Why it can not read chunk value of *.doc, may be because it is in Ole compound document format.

    After reading msdn site, I think we should use IPropertySetStorage and IPropertyStorage interface to get Ole document properties. Or what do you have any solution?

    I have asked this question in many forums, blogs but have not got the exact solution.
    If you have face the same problem before, please share with us. It would be very valuable answer.

    If you need any further informations, please let me know.

    Thanks
    Prakash

  24. Machelle Steege

    Hello, I belive this is a amazing web-site with nice stuff. That could be why I like to request you if I can speak about your blog on my blog if I offer you link back again?

Comments are closed.