Wednesday, September 16, 2009

Alfresco Integration with GSA

Alfresco Integration with GSA

In order to provide searching within the portal a strategy had to be defined with how to integrate Alfresco with GSA. There were two approaches considered:

1. Utilize the traditional approach and have GSA crawl Alfresco through either a webscript mechanism or via CIFS.

2. Utilize the GSA Feed based approach.

After careful review we decided upon the feed base approach for the following reasons:

1. Meta Data: In order to support the Faceted searching, we need to find a way to attach metadata to each content item. Given that our HTML code is just snippets and does not contain a header with this information and that we are indexing documents, the only way to reliably accomplish this was via the feed.

2. Portal Page: For each of the Content portlets, we need to determine a way to identify the underlying portal page that a given content item lives on. The only way to reliably accomplish this was via the feed.

3. Security: GSA does have support for a late binding approach to security whereby on a search request a security check would be performed against the underlying system to see if a user had access. In order to support this we would have had to pass along the security credentials to GSA and setup GSA to access Active Directory. Furthermore, we determined that the processing overhead associated with this check would slow down the search results. Therefore, we add security metadata to the feed and just add that to the query string.

Please refer to the GSA documentation for the documentation on how to develop a feed:

http://code.google.com/apis/searchappliance/documentation/50/feedsguide.html

http://code.google.com/apis/searchappliance/documentation/50/metadata.html

The following sections will describe in some detail how this is accomplished for each both the Alfresco DM and WCM content stores.

GSA Metadata

In order to support the faceted searching, we needed to add our own custom Metadata to a given content item. This was in part accomplished by defining a custom content model. Below are some references that we utilized to create our own custom content model:

http://wiki.alfresco.com/wiki/Step-By-Step:_Creating_A_Custom_Model

http://ecmarchitect.com/archives/2007/06/09/756

Alfresco DM

Our feed from DM was straightforward, we just created an Alfresco webscript that traversed the repository and created the XML in the format required by GSA. In addition to traversing the standard repository, the Alfresco webscript traversed the archive store to obtain a list of files that should be deleted from the index. The output of the webscript was the XML format required by GSA.

On the portal side, we utilized Quartz to created a scheduled process that executed this webscript, obtained the XML and passed it along to GSA via the Feed URL.

Alfresco WCM

Our feed from WCM was nearly the same. We created an Alfresco webscript that traversed the repository and created the XML in the format required by GSA.

On the portal side, we utilized Quartz to create a scheduled process that executed this webscript. Once the response is returned from the webscript, we traverse each record within the results set and attempt to determine if the given content item has been mapped to a page within the portal. If it has been matched to one or more portal pages, then the custom metadata attribute pageId is set to the Portal Page ID. This allows the search client to generate the appropriate portal URL for display to the client.

Then we post the modified XML to GSA via the Feed URL.

Please note, one limitation of WCM versus DM is that it does not have a mechanism to determine if a file has been removed. If this occurs, then you will have to manually construct a delete feed and push it to GSA.

Conclusion

Overall this approached worked very well for us. The only issue we ran into was the query string limitations of 2048 characters inherent in GSA. The reason our query was long because we had so many custom fields that needed to query against when searching. In order to account for this, we only sending along those fields that are required. In the end, this solution met the client’s business requirements and provided an effective search experience

2 comments:

  1. Hi Ron

    Interesting article. We have a GSA here at my company and have been evaluating Alfresco. I like the application but I find the built in search to be kind of weak. The accuracy is pretty good but the displayed results are not user friendly - ie - no article previews or highlighted text snippets.

    We plan to use Alfresco as a research library and people have to be able to skim through many docs - a pretty thumbnail wont really help with that!

    Any ideas you might have would be great - or if your company engages in this sort of customization, I would be happy to discuss in greater detail.

    Thanks!

    MD

    markduffield A T GEE MAIL . COM

    ReplyDelete
  2. Nice post! This is a very nice blog that I will definitively come back to more times this year! Thanks for informative post. GSA Search Engine Ranker verified list

    ReplyDelete