Pre-Search Facets in MOSS 2007

Search facets offer a powerful entry point into data exploration, especially in cases where data is categorized or tagged effectively.  With modern search and content retrieval mechanisms, such as the search engine in MOSS (and new FAST search engine for SharePoint) the traditional method of browsing for content in SharePoint using static navigation hierarchies can now take on a whole new approach.

The new thinking behind content storage is to put all artifacts in one large bucket, tag the artifacts, and then leverage search and facets to surface relevant content.  Think about how Google revolutionized mail by discarding the folder structure approach in favor of a search and label paradigm.

Anyone who has played with Faceted searching in SharePoint probably knows that, aside of the commercial tools like BA Insight, the only real free option is to use the Codeplex Faceted Additions.

fs3.png

The Codeplex offering assumes “post” search faceting, in that the web parts determine the relevant facet headings and count based on the current executed result set.  This approach makes good for filtering search results and allowing users to drill down on with restricted queries, but what about pre-facet browsing, similar to the functionality on sites like Best Buy?

Here is the problem – to provide the user with a dynamic tree view if hierarchical data based on facet categorization, the hierarchy generation method needs to know about all potential facet values ahead of time.  Take the following example:

An organization tags all their documents with a document type and department.  Let’s assume we wanted to provide a dynamic list of departments, which the user could choose, and then a list of document types available for the selected department.  After selecting the document type we’d like the user to see all documents of the selected type that sourced from the selected department.

Aside of issuing a general search, and then filtering the result set by department and document type, the Codeplex faceted search web parts do not appear to offer a mechanism to provide dynamic table-of-content like behavior.

So I got to thinking – search facets in MOSS are no more than managed properties that exist in the search index.  Surely the object model must enable me a way to query distinct values of a given managed property?  It turns out that you can query the search API for this information, and with a little code magic you can obtain the results desired:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;
using Microsoft.Office.Server.Search;
using Microsoft.SharePoint;
using Microsoft.Office.Server.Search.Query;

namespace SearchFacets
{
    /// <summary>
    /// Faceted search querying.
    /// </summary>
    class Program
    {
        static readonly string SRCURL = "http://server/";

        /// <summary>
        /// Entry point.
        /// </summary>
        /// <param name="args"></param>
        static void Main(string[] args)
        {
            using (SPSite srcSite = new SPSite(SRCURL))
            {
                string query = "SELECT DocumentType FROM Scope() WHERE "SCOPE" = 'My Scope Documents'";
                string[] values = GetDistinctSearchResults(srcSite, query, 1000);
                foreach (string s in values)
                    Console.WriteLine(s);
            }
        }

        /// <summary>
        /// Get some search results using full text.
        /// </summary>
        /// <param name="context">Site.</param>
        /// <param name="searchQuery">Query.</param>
        /// <param name="searchLimit">Limit results.</param>
        /// <returns>Results.</returns>
        static string[] GetDistinctSearchResults(SPSite context, string searchQuery, int searchLimit)
        {
            using (var fullTextQuery = new FullTextSqlQuery(context))
            {
                fullTextQuery.ResultTypes = ResultType.RelevantResults;
                fullTextQuery.QueryText = searchQuery;
                fullTextQuery.KeywordInclusion = KeywordInclusion.AnyKeyword;
                fullTextQuery.EnableStemming = false;
                fullTextQuery.TrimDuplicates = false;
                fullTextQuery.RowLimit = searchLimit;

                ResultTableCollection resultsCollection = fullTextQuery.Execute();
                ResultTable resultsTable = resultsCollection[ResultType.RelevantResults];
                return ReturnDistinct(resultsTable);
            }
        }

        /// <summary>
        /// Return distinct list.
        /// </summary>
        /// <param name="rtWins">Restult set.</param>
        /// <returns>Distinct values.</returns>
        static string[] ReturnDistinct(ResultTable rtWins)
        {
            DataTable dtWins = null;
            Dictionary<String, int> pairs = new Dictionary<string, int>();
            List<String> lstWins = new List<string>();
            dtWins = new DataTable("dtWINS");
            dtWins.Load(rtWins);

            foreach (DataRow drWin in dtWins.Rows)
            {
                string fieldName = drWin[0].ToString();
                if (pairs.ContainsKey(fieldName))
                    pairs[fieldName]++;
                else
                    pairs.Add(fieldName, 0);
            }

            foreach (KeyValuePair<String, int> pair in pairs)
                lstWins.Add(String.Format("{0} ({1})", pair.Key, pair.Value));
            return lstWins.ToArray();
        }
    }
}

You might be thinking “Hey, you’re just executing a search”, and you’d be right.  Since the facet values (managed property values map to crawled properties) live in the search indexes we have no choice but to perform a search to get at these values.

The key in the above code is to limit the search results returned (1000 in above case) and take advantage of relevancy.  In all likelihood; any search results beyond 1000 hits will not likely produce facet values that map to many results of value to the end user.

Clearly, the above code is just a starting point and has potential for many improvements, such as caching, making use of parent child relationships etc, but you get the idea…