Tuesday, October 20, 2009

Google Search Appliance (GSA) Sorting in Portal

At several of our clients, we have integrated the Google Search appliance into a Portal. In order to accomplish this integration we could take 1 of 2 approaches:

1. Utilize GSA’s built-in ability to format the presentation logic via a XLST.

2. Utilize GSA’s ability to return straight XML.

Both approaches work well and can suit the needs of a portal. Option 1 though will not work if you need to sort the entire result set prior to displaying it to the users. The reasons for this is as follows:

1. GSA does not provide the ability to retrieve more than 100 results at a time

2. GSA’s built in sorting only sorts the first 100 results.

3. Sorting on things other than Date or Relevance [e.g. Meta Data] requires some XSLT work and it is still bound by the limitations of only sorting the 100 records at a time.

Option 2 still has the limitation of fetching 100 records at a time, but you can sort it client side as requirements dictate. Our approach to accomplishing this typically involves the following:

1. Creating client side code that dynamically fetches the entire result from GSA by fetching blocks of 100 results at a time up to the maximum available.

2. Store the resulting composite XML in a cached region for a predetermined amount of time. The caching algorithm for the key and time should be configurable so that it can be adjusted as needed.

3. After fetching and storing the results, sort them based upon the client input.

Conclusion

Overall Option 2 worked very well for us when the sorting requirements exceed those available to you by the built-in mechanisms provided by GSA. The one challenge to keep in mind is the memory requirements needed for caching and the time required to fetch the results in chunks. In both cases, we found that the memory requirements rarely had an adverse impact on our portal and the fetch time was only incurred by the first requestor and was rarely noticeable.

Wednesday, September 16, 2009

Alfresco Integration with GSA

Alfresco Integration with GSA

In order to provide searching within the portal a strategy had to be defined with how to integrate Alfresco with GSA. There were two approaches considered:

1. Utilize the traditional approach and have GSA crawl Alfresco through either a webscript mechanism or via CIFS.

2. Utilize the GSA Feed based approach.

After careful review we decided upon the feed base approach for the following reasons:

1. Meta Data: In order to support the Faceted searching, we need to find a way to attach metadata to each content item. Given that our HTML code is just snippets and does not contain a header with this information and that we are indexing documents, the only way to reliably accomplish this was via the feed.

2. Portal Page: For each of the Content portlets, we need to determine a way to identify the underlying portal page that a given content item lives on. The only way to reliably accomplish this was via the feed.

3. Security: GSA does have support for a late binding approach to security whereby on a search request a security check would be performed against the underlying system to see if a user had access. In order to support this we would have had to pass along the security credentials to GSA and setup GSA to access Active Directory. Furthermore, we determined that the processing overhead associated with this check would slow down the search results. Therefore, we add security metadata to the feed and just add that to the query string.

Please refer to the GSA documentation for the documentation on how to develop a feed:

http://code.google.com/apis/searchappliance/documentation/50/feedsguide.html

http://code.google.com/apis/searchappliance/documentation/50/metadata.html

The following sections will describe in some detail how this is accomplished for each both the Alfresco DM and WCM content stores.

GSA Metadata

In order to support the faceted searching, we needed to add our own custom Metadata to a given content item. This was in part accomplished by defining a custom content model. Below are some references that we utilized to create our own custom content model:

http://wiki.alfresco.com/wiki/Step-By-Step:_Creating_A_Custom_Model

http://ecmarchitect.com/archives/2007/06/09/756

Alfresco DM

Our feed from DM was straightforward, we just created an Alfresco webscript that traversed the repository and created the XML in the format required by GSA. In addition to traversing the standard repository, the Alfresco webscript traversed the archive store to obtain a list of files that should be deleted from the index. The output of the webscript was the XML format required by GSA.

On the portal side, we utilized Quartz to created a scheduled process that executed this webscript, obtained the XML and passed it along to GSA via the Feed URL.

Alfresco WCM

Our feed from WCM was nearly the same. We created an Alfresco webscript that traversed the repository and created the XML in the format required by GSA.

On the portal side, we utilized Quartz to create a scheduled process that executed this webscript. Once the response is returned from the webscript, we traverse each record within the results set and attempt to determine if the given content item has been mapped to a page within the portal. If it has been matched to one or more portal pages, then the custom metadata attribute pageId is set to the Portal Page ID. This allows the search client to generate the appropriate portal URL for display to the client.

Then we post the modified XML to GSA via the Feed URL.

Please note, one limitation of WCM versus DM is that it does not have a mechanism to determine if a file has been removed. If this occurs, then you will have to manually construct a delete feed and push it to GSA.

Conclusion

Overall this approached worked very well for us. The only issue we ran into was the query string limitations of 2048 characters inherent in GSA. The reason our query was long because we had so many custom fields that needed to query against when searching. In order to account for this, we only sending along those fields that are required. In the end, this solution met the client’s business requirements and provided an effective search experience

Tuesday, August 18, 2009

Alfresco Impersonation

On my current project, we are using Alfresco and working on an integration with JBoss Portal. In this case we were building a component that allowed for the browsing, uploading, moving, renaming, and deleting of files. We had built all the Alfresco Web Scripts to support these operations. In order to ensure the proper auditing of the changes, we needed to implement a WebScripts component that performed impersonation of the user that was executing the action. After some Google searching, we found the following common solution to the problem:

public String impersonate(String username) {

String currentUser = AuthenticationUtil.getCurrentUserName();

if (currentUser == null || !currentUser.equals(username)) {
AuthenticationUtil.setCurrentUser(username);
}

return currentUser;
}

With this code the owner is set correctly, but the creator and modifier fields are not being set to the username we are impersonating. Furthermore, the permission checks (via hasPermission) were behaving correctly in that we were authenticating for the impersonated user. After some searching through the Alfresco API we had to change our impersonation code as follows:

public String impersonate(String username) {
String currentUser = AuthenticationUtil.getFullyAuthenticatedUser();

if (currentUser == null || !currentUser.equals(username)) {

AuthenticationUtil.setRunAsUser(username);
AuthenticationUtil.setFullyAuthenticatedUser(username);
}

return currentUser;
}

The reason the original solution did not work for us is because the Alfresco engine runs background processes on the content item to apply any rules that may be defined on the workspace. If you do not ensure that those background processes run in the impersonated user’s context, then they will run as the system account. Reading the new API calls is what made the light bulb go off in my head. The API setRunAsUser says that it switches to the given user for all authenticated operations and the API setFullyAuthenticatedUser places the guarantees that the given users are set for all operations. Therefore the combination of these two API calls guarantees that all operations will be run in the context of this user.

Friday, February 27, 2009

Alfresco Web Forms Integration - Mock JSF Faces Context

On my current project, we are using Alfresco and working on an integration with JBoss Portal. In particular, we were creating our own version of Alfresco WebForms editor that is built into their web client. We had built all the Alfresco Web Scripts to fetch the appropriate WebForm for a given content item and a college of mine had built all the Portal magic to render and save the form in a fashion similar to Chiba. The last component I needed to build was the Webscript to generate the renditions of the web forms within Alfresco. I found the magic component with Alfresco that did this in AVMEditBean. But in order to utilize it, I had to cut and paste the following lines:

if (services.getAVMService().hasAspect(nodeRef.getVersion(), nodeRef.getPath(), WCMAppModel.ASPECT_FORM_INSTANCE_DATA)) {
this.regenerateRenditions(nodeRef);
}

private void regenerateRenditions(AVMNode node)
throws FormNotFoundException {
final String avmPath = node.getPath();
final FormInstanceData fid = formsService.getFormInstanceData(-1, avmPath);
final List result = fid.regenerateRenditions();
for (FormInstanceData.RegenerateResult rr : result) {
if (rr.getException() != null) {
Utils.addErrorMessage(
"error regenerating rendition using "
+ rr.getRenderingEngineTemplate().getName() + ": "
+ rr.getException().getMessage(), rr.getException());
}
}
}

The trouble, I ran into was all the utility classes that were called by this blurb of code required a static instances of the current Alfresco FacesContext. My first instinct was to start pulling apart each of these utility classes that were called and removing their dependence upon the FacesContext. I had done this effectively in the past when I was mimicking the Submit All functionality, but this time it was not going as well. I seem to keep digging myself a bigger and bigger hole of code that following the Cut And Paste Anti-Pattern. I was talking with my colleague Phil Kedy and he suggested that we create a Mock version of the FacesContext and just initialize it with the things that the Alfresco utility classes needed, mainly the Spring WebContext. He got to work on building the Mock class and I figured out to wire in Spring context. Wiring in the Spring context was fairly easy because all Alfresco Webscripts are declared in a Spring context file so I just had my Webscript implement the Spring interface ApplicationContextAware and we would be set. Mr. Kedy figured out how to create the mock class and now all I had to add in was the following:

MockFacesContext faces = new MockFacesContext((WebApplicationContext) ctx);

Now in conjunction with the MockFacesContext, he had to create 2 supporting classes a MockApplication class that implemented the method createValueBinding. This was necessary in order to process the el syntax that Alfresco used to locate the Alfresco Services. And in conjunction with that he had to implement a MockValueBinding object to handle the location of the services. And finally, we had to load in the message bundles to handle the word substitution.

Once all of this was done, we had a working version of the Web Forms upload and rendition generation that worked within the JBoss Portal.

Sunday, February 22, 2009

Posting JSON with Commons HTTPClient and XStream

I recently had an occasion where I had to perform an HTTP POST with JSON data from a Java service class as oppose to Javascript. No amount of Google searches turned up the answer I was after. Here are the steps I took to do so:

STEP 1 - Handle HTTP Post
The project I am working we were already using Commons HTTPClient which has a PostMethod class that peforms an HTTP Post. Here is the code to setup the post:

HttpClient clientService = new HttpClient();
PostMethod post = new PostMethod();
post.setURI(new URI("http://yoururl" false));

Step 2 - Find JSON Converter
The best tool kit I found for handling JSON is a combination of XStream and Jettison. Following the XStream tutorial, I did the following:

// This ensure that we drop the root element
XStream xstream = new XStream(new JsonHierarchicalStreamDriver() {
public HierarchicalStreamWriter createWriter(Writer writer) {
return new JsonWriter(writer, JsonWriter.DROP_ROOT_MODE);
}
});

xstream.setMode(XStream.NO_REFERENCES);

// Stream the class I want converted into JSON
xstream.alias("site", ProjectBean.class);

Step 3 - Post the JSON Stream
Next up, is putting the two together and posting the JSON.

// Model bean to stream
ProjectBean site = new ProjectBean();

post.setRequestHeader("Content-Type", "application/json");

// apply content-length here if known (i.e. from proxied req)
// if this is not set, then the content will be buffered in memory
post.setRequestEntity(new StringRequestEntity(xstream.toXML(site), "application/json", null));

// execute the POST
int status = clientService.executeMethod(post);

// Check response code
if (status != HttpStatus.SC_OK) {
throw new Exception("Received error status " + status);
}

Conclusion
That is all there. I hope this helps.