Social Content Management: Activiti + Alfresco One + Mule

Once upon a time, Social Content Management and Social Business Systems were all the rage. And by once upon a time, I really mean 2010-2011. This pretty much equates to a bajillion years ago in Bay Area timelines. I suppose we needed something to bridge the gap between famed hype words of the time like Enterprise 2.0. So why bother with digging up the past? Well I don't think any of the major ECM vendors truly executed on Social Content Management. 

This is partially due to the maturity of social media providers at the time and the way that consumers leveraged their platforms. Early on, social media was used to connect with like-minded people and to chronicle your daily adventures, thoughts, opinions, useless food pics, and the ubiquitous selfie...with or without the pouty duck face. Nowadays, social media is used to inspire entire movements and to inspire social change. What makes this possible, is the absence of confines and the lack of borders or controls. However, there are instances in which we need to create a snapshot in time and capture these social moments for compliance purposes or to drive business processes. To do this, we're going to integrate best-of-breed open platforms such as Activiti, Alfresco One, and Mule to create a supercharged Social Content Management solution.

Best of breed = Jet Propulsion + Olympic Swimmer

Let's talk about the type of social content we want to capture. You remember all of those embarrassing moments when you drunkenly tweeted or posted something on Facebook? And for whatever reason, in your drunken haze the next morning, you thought it was best to expunge all of the evidence? How are you ever going to relive those infamous moments?! Well rest assured, because I'm happy to tell you that we can archive all of those precious moments in a DoD 5015.2 compliant system using Alfresco One and its Records Management module!!!

TRUE RECORDS MANAGEMENT > TRUE BLOOD

To capture all of this embarrassing social content, first we need to create a collaboration site in Alfresco One (or Alfresco Community) and prep it with some folder structure. This is where we're going to dump all of our social content from Twitter and manage it accordingly. 

 

Now that we have a place to manage our social content, we'll leverage the Mule ESB and the Anypoint Studio graphical design environment to create a flow to capture it. Below is snapshot of our overall Mule flows which consist of an HTTP connector, Twitter connector, a For Each scope, a CMIS connector and a few transformers sprinkled in-between.

Going with the flow

  Let's begin by breaking the flow down piece by piece. The HTTP connector is responsible for initiating the flow by listening for HTTP requests and it will also provide an HTTP response after the flow has ended. Below is a screenshot of the HTTP Connector configuration. The important elements to take notice of are the host and port global properties.

After we define our HTTP connection properties, we'll want to provide a specific URL path so that we can call it from  an HTTP client and so we can uniquely identify the endpoint from others that may be running on Mule.

  Now lets tackle the Twitter step. Before we continue our configuration in Anypoint, we'll need to head over to https://apps.twitter.com/ to create a new app. Once created, switch over to the Keys and Access Tokens tab, and retrieve the consumer key, consumer secret, access token, and access token secret.

Now we'll take the consumer and access keys and secrets and drop them into our Twitter Connection Configuration. Note that you can put the values in the connection configuration directly OR you can place them in your mule-app.properties file and simply use a placeholder expression to refer to those properties. 

In the Twitter step configuration, we'll need to choose the Search operation and use the [#json:hashtag] expression to populate the Query variable. This expression will pull a "hashtag" JSON property from the incoming HTTP request. One thing to note is that if you take a closer look at the available variables in the Search operation and compare them to the GET search/tweets Twitter documentation, the Count variable is completely missing! This means that your search result count will always be limited to the default of 15 tweets. So so so lame. Someone has already created an issue on the Twitter connector GitHub project here. But that was created about a year ago and hasn't received any love. If I'm feeling bored over the holidays, I'll update the Twitter connector to include the Count variable and will update the connector to use the latest and greatest twitter4j version. 

Where's the beef? I mean the count variable...

Once we execute a search, we'll want to loop through all of the search results and perform a series of operations. To accomplish this in our Mule flow, we'll need to leverage a For Loop scope step. In that step, we need to specify a collection to iterate upon. This will be the payload from the Twitter search operation and the "tweets" JSON array element.  Lastly, inside the For Loop, we're simply going to call our "create-twitter-content-flow" using a flow-ref step.

In the "create-twitter-content-flow," we'll need a Set Property transformer step which will set outbound properties to be used in subsequent steps in the flow. Specifically, we're going to leverage the twitter handle or screenName property, the unique id for the tweet, and the tweet text. Alternatively, you could use the Transform Message step which leverages the new DataWeave language and template engine.

The next step in the flow that we need to configure is a CMIS connector. In the CMIS Connection configuration, provide the ATOM binding CMIS URL, username, password, and select ATOM for the Endpoint property. Dropping back into the CMIS connector step, we're going to choose the "Get or create folder by path" operation and we're going to use "/Sites/yourSiteShortNameHere/My/Folder/Path/#[message.outboundProperties.screenName]" as the Folder Path value. This will dynamically create a new folder or retrieve an existing folder based on the predefined path and twitter handle.

The last step in our Mule flow will be another CMIS connector step that will create content based on the text from a tweet. To accomplish this, we need to choose the "Create document by path" operation, set the filename to be the tweetId, set the folder path of the folder we created in the previous step, set the Content Reference to the tweetContent property, set the Mime Type to text/plain, set the Object Type to cmis:document, and finally set the Versioning state to MAJOR. 

For simplicity, I've decided to not build any additional logic to detect if an existing piece of content with the same exact name in located in the defined folder path. So keep in mind that if you run this twice in a row, it will most certainly fail with an exception stating that content with the same name already exists in that folder. But this is just a demo, so don't even worry about it!

Now that we've completed our Mule flow, let's switch over to our Activiti BPM Suite instance to build a simple process to orchestrate our Mule flow. Before we build the process, we need to define an endpoint to connect to our Mule flow. To do this, navigate to the Identity Management app, switch to the tenants page, and then go to the Endpoints tab. From there, just add a new endpoint with the host and port that was configured in the Mule flow's HTTP connector.

1

 

2

3

Our endpoint has been created, so now let's go back to our dashboard and jump into the Kickstart application. The process we want to create can be done in either the BPMN editor or the more simplified Step Editor. Let's keep it simple and create a Step process called "Social Media Archive." The first thing we're presented with is a simple Process start step and we can choose to either Start the process by a User without any input or we can choose to Start the process by a User filling in a form. We're going to choose the latter and since we do not have a form for this particular use case in our forms library, we'll want to create a brand new form. Give your form a simple name and click create. In this form, we'll only need one field to capture the Twitter search text, so drag and drop the Text control from the left into the canvas on the right-hand side. Next, provide a name by clicking on the pencil icon and finally click the form's save button. 

1

2

4

 

3

 

5

The first step in our process design is complete, so let's move onto the most important step. Click the plus icon after the start step and choose the REST Call step. In the Details tab, provide a nice name that you're proud of and switch over to the Endpoint tab. Here, we'll choose the POST as our HTTP Method, the endpoint we created earlier will be used for the Base Endpoint, and add "twitter-archive" will be used as the REST url value. The REST URL suffix will correspond with the suffix we created in our Mule flow's HTTP connector step. Switching over to the Request tab, we only need to add one JSON property to send along with our payload. We'll pull the hashtag value in from the form field created in the start form and we also want to make sure we use the "hashtag" JSON property name that corresponds with the Mule flow. Last, but not least, in the Response tab lets map JSON values for "query", "count" and "completedIn" to process variables. 

1

 

2

 

4

6

3

 

5

7

The final step in our Social Media Archive process will be a simple human step to display the response from Twitter and the Mule flow. Click the plus icon after the REST Call step, and provide a simple name in the Details tab. Switch to the Form tab create a new form. Provide a name to your new form and drag and drop three Display Value form controls into the canvas. Click the pencil icon next to each of the Display Value controls. Instead of displaying a form field, switch to the Variable tab and choose the JSON process variables created in the REST Call step. Finally, validate the process and save it. 

1

 

2

4

3

5

6

7

The last piece of configure for our solutions is to create a process application to encapsulate our Social Media Archive process. After navigating to the App page within the Kickstart app, click the Create App button and provide a new for your new process app. In the app definition page, you can configure a specific icon and color for the app. The most important piece of the app definition is to include your process. After you've included the Social Media Process click the save button, choose to publish the application, and finally save and close the editor. Finally, back on our dashboard, click the plus icon and add your newly created process application.

1

2

4

6

 

3

 

5

 

7

Now we can navigate to our Social Content Management app and to start our newly created process. Provide a hashtag and click the Start Process button. If everything worked correctly, after a few seconds, you should be able to refresh the browser and find a review step assigned to yourself. Navigate to the review step and check out the results.

There's only one thing left to do, so head back to Alfresco One's Share UI and check out the dynamically created folder structure based on twitter handles and the tweets stored inside each folder. 

To recap, we've created a Social Content Management solution by leveraging Mule to pull content, transform data, and push it into Alfresco One. We also create a process in Activiti to orchestrate our Mule flow. By themselves, Activiti, Alfresco One, and Mule can be extremely powerful. But integrating all three together is on a whole other level.

Pelvis pong

Also check out the following YouTube video to see all of this working in real-time.

The source for the Mule project and the Activiti process application can be found below: 

Mule Project

Activiti Process Application


- Keep Calm and Nerd On!!!

Alfresco Version Pruning Behaviour....Mmmm Prunes

This is the first post of a multi-part series for an umbrella project I might be calling "alfresco-content-control." Its a working title. In this particular post we'll demonstrate how you can limit your version history and prune the unneeded versions. I racked my brain trying to think of a cooler theme for this post, but best I can come up with is....the elderly and their love for prunes. Mainly for the digestive benefits...and possibly for the taste. Neither of which are related to enterprise content management...but we'll just go with it.

I should say that this problem has already been tackled before by one of my fellow Alfresco colleagues Jared Ottley and he also blogged about it in his Max Version Policy post. His project on GitHub was an amazing foundation and actually provided a bulk of the logic needed for a prospective customer of mine. I just needed to turn the hearing aid up to 11 to cover some of the following requirements that were missing....

  • Expose the max version property in Alfresco Share through content model properties.
  • Provide the ability to dynamically apply the version pruning behavior to individual pieces of content.
  • Provide the ability keep the root version if needed. This however can be a slippery slope since you may want to mark specific versions as ones to keep permanently. We're not going to tackle that. At least not for now.

So with our list of enhanced requirements, I had enough information to take the project to the next level. Boom! Crack open a can of PBR because we're halfway there!

The first thing we had to do was to create a super simple content model with an aspect or property group called "Version Prunable." Essentially our goal was to move the properties that were initially stored in a file on the classpath and elevate them into content model. That included the existing property for maximum version count and an additional boolean property to provide the flexibility to keep the root version or not. By moving these into a content model, we can easily expose the properties and their associated values in Share.

version-pruning-model.xml:

<?xml version="1.0" encoding="UTF-8"?>
<model name="prune:versionPruningModel" xmlns="http://www.alfresco.org/model/dictionary/1.0">
    <description>Version Pruning Content Model</description>
    <author>Kyle Adams</author>
    <version>1.0</version>

    <imports>
        <import uri="http://www.alfresco.org/model/dictionary/1.0" prefix="d" />
        <import uri="http://www.alfresco.org/model/content/1.0" prefix="cm" />
    </imports>

    <namespaces>
        <namespace uri="http://www.alfresco.org/model/extension/version-pruning/1.0" prefix="prune" />
    </namespaces>

    <aspects>
        <!-- Version Prunable Aspect -->
        <aspect name="prune:versionPrunable">
            <title>Version Prunable</title>
            <properties>
                <property name="prune:maxVersionCount">
                    <title>Max Version Count</title>
                    <description>Max Version Count</description>
                    <type>d:int</type>
                    <default>-1</default>
                </property>
                <property name="prune:keepRootVersion">
                    <title>Keep Root Version</title>
                    <description>Keep Root Version</description>
                    <type>d:boolean</type>
                </property>
            </properties>
        </aspect>
    </aspects>
</model>

Then we had to make some small modifications to the behavior (or behaviour if you think thats more proper). I feel like I should be drinking my PBR with my pinky turned up when saying behaviouuuuur.

Anyways, this included adding logic to pull in property values from a given content node and an if statement to evaluate if the Version Prunable aspect has been applied to the node. Another minor additional was the if/then/else block to evaluate whether we need to delete the root version or the successor of the root version within the VersionHistory

Here's the VersionPruningBehaviour.java implementation (BTW Squarespace has horrible syntax highlighting support for Java code blocks...sorry?)

package org.alfresco.extension.version.pruning.behaviour;

import org.alfresco.extension.version.pruning.model.VersionPruningContentModel;
import org.alfresco.model.ContentModel;
import org.alfresco.repo.policy.Behaviour;
import org.alfresco.repo.policy.JavaBehaviour;
import org.alfresco.repo.policy.PolicyComponent;
import org.alfresco.repo.version.VersionServicePolicies;
import org.alfresco.service.ServiceRegistry;
import org.alfresco.service.cmr.repository.NodeRef;
import org.alfresco.service.cmr.repository.NodeService;
import org.alfresco.service.cmr.version.Version;
import org.alfresco.service.cmr.version.VersionHistory;
import org.alfresco.service.cmr.version.VersionService;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import java.util.Collection;

/**
 * Created by kadams on 7/14/15.
 */
public class VersionPruningBehaviour implements VersionServicePolicies.AfterCreateVersionPolicy {
    private static final Log logger = LogFactory.getLog(VersionPruningBehaviour.class);

    private ServiceRegistry serviceRegistry;
    private PolicyComponent policyComponent;
    private NodeService nodeService;
    private VersionService versionService;

    private int maxVersionCount;
    private boolean keepRootVersion;

    public void init(){
        this.policyComponent.bindClassBehaviour(
                VersionServicePolicies.AfterCreateVersionPolicy.QNAME,
                ContentModel.TYPE_CONTENT,
                new JavaBehaviour(this, "afterCreateVersion", Behaviour.NotificationFrequency.TRANSACTION_COMMIT));

        this.nodeService = this.serviceRegistry.getNodeService();
        this.versionService = this.serviceRegistry.getVersionService();

    }

    @Override
    public void afterCreateVersion(NodeRef versionedNodeRef, Version version) {

        try {
            if(this.nodeService.hasAspect(versionedNodeRef, VersionPruningContentModel.ASPECT_VERSION_PRUNABLE)) {
                VersionHistory versionHistory = this.versionService.getVersionHistory(versionedNodeRef);
                if(versionHistory != null){
                    this.keepRootVersion = (boolean) this.nodeService.getProperty(versionedNodeRef, VersionPruningContentModel.PROP_KEEP_ROOT_VERSION);
                    this.maxVersionCount = (int) this.nodeService.getProperty(versionedNodeRef, VersionPruningContentModel.PROP_MAX_VERSION_COUNT);

                    if(maxVersionCount > 0){
                        while(versionHistory.getAllVersions().size() > maxVersionCount){
                            Version versionToBeDeleted = null;
                            if(keepRootVersion) {
                                 versionToBeDeleted = versionHistory.getSuccessors(versionHistory.getRootVersion()).iterator().next();
                            }
                            else{
                                versionToBeDeleted = versionHistory.getRootVersion();
                            }

                            if(logger.isDebugEnabled()){
                                logger.debug("Max Version Count: " + maxVersionCount);
                                logger.debug("Keep Root Version? " + keepRootVersion);
                                logger.debug("Current version history collection size: " + versionHistory.getAllVersions().size());
                                logger.debug("Preparing to remove version: " + versionToBeDeleted.getVersionLabel() + " type: " + versionToBeDeleted.getVersionType());
                            }
                            this.versionService.deleteVersion(versionedNodeRef, versionToBeDeleted);
                            versionHistory = this.versionService.getVersionHistory(versionedNodeRef);
                        }
                    }
                }
                else{
                    if(logger.isDebugEnabled()){
                        logger.debug("No version history found!");
                    }
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public void setServiceRegistry(ServiceRegistry serviceRegistry) {
        this.serviceRegistry = serviceRegistry;
    }
    public void setPolicyComponent(PolicyComponent policyComponent) {
        this.policyComponent = policyComponent;
    }
}

ROCK! Now we have everything in place to keep our VersionHistory from getting out of hand. I should however provide an alternative. The latest version of Alfresco's DoD 5015.2-certified Records Management module supports the ability to retain and subsequently destroy individual versions of a work-in-process document. This module is a far better approach for content that is regulatory in nature. Now if you're not managing regulatory content in Alfresco, the RM module might possibly be more horse power than you need. So let's take a quick tour of the version pruning behaviour functionality. Demo........ENGAGE!

So after leveraging a good base from Jared Ottley's Max Version Policy project, we were able to expose version pruning behaviour configuration pretty easily in Share. You can access the source for the alfresco-content-control umbrella project here on GitHub.

- Keep Calm and Grandpa On

Oldie but a Goldie: XPath Metadata Extraction

It's been quite some time since I've written an Alfresco blog post, but I finally decided to commit to being more active in the Alfresco community through blogs and other activities. Since I'm dusting off the old blog and starting anew, I thought it would be fitting to dust off an old Alfresco feature that can still prove to be useful. 

When thinking about a theme for the post, the first thing that came to mind was...well...hipsters...

 

Hipsters often revitalize old trends and from the picture above we find that some trends are better than others. As a Principal Solutions Engineer, I've found one vintage Alfresco feature that has proven to be useful in more than one of my pre-sales opportunities with prospective customers. If we harken back to the Alfresco 2.x days, the now defunct WCM AVM product had a cool feature to perform XPath Metadata Extraction. Essentially this was used to map XML content to AVM Web Forms that would later be used in a presentation layer of your choice. While AVM has been laid to rest, the XPathMetadataExtracter class lives on as a core repository capability. 

Now does this vintage and possibly hipster Alfresco feature deserve to be brought back into the mainstream? I say, absolutely! This is actually a very common pattern being used by lots of organizations for processing anything from financial statements to technical publications (DITA).

So lets actually dive in with an example in which we'll manage hipster or indie rock artist XML content within Alfresco. DISCLAIMER: I actually have very hipster taste in music...so don't judge too harshly ;)

The first thing we'll need is a hipster content model to manage our artists. In the model below, you'll see that its a pretty simple model with text and date properties. A few of which are multi-value properties. 

hipster-model.xml:

<?xml version="1.0" encoding="UTF-8"?>
<model name="hip:hipsterModel" xmlns="http://www.alfresco.org/model/dictionary/1.0">
  
  <description>Hipster Content Model</description>
  <author>Kyle Adams</author>
  <version>1.0</version>
  
  <imports>
    <import uri="http://www.alfresco.org/model/dictionary/1.0" prefix="d" />
    <import uri="http://www.alfresco.org/model/content/1.0" prefix="cm" />
  </imports>
  
  <namespaces>
    <namespace uri="http://www.massnerder.io/model/1.0" prefix="hip" />
  </namespaces>
  <constraints>
    <!-- Indie Genre Constant -->
    <constraint name="hip:genreConst" type="LIST">
      <parameter name="allowedValues">
        <list>
          <value>Emo</value>
          <value>Garage Rock</value>
          <value>Hardcore</value>
          <value>Indie Americana</value>
          <value>Indie Doo-Wop</value>
          <value>Indie Folk</value>
          <value>Indie Pop</value>
          <value>Indietronica</value>
          <value>Lo-fi</value>
          <value>Nu-hula</value>
          <value>Pop Punk</value>
          <value>Post Hardcore</value>
          <value>Surf Rock</value>
        </list>
      </parameter>
    </constraint>
  </constraints>
  
  <!-- Content Types -->
  <types>
    <!-- Artist Type -->
    <type name="hip:artist">
      <title>Artist</title>
      <parent>cm:content</parent>
      <properties>
        <property name="hip:artistName">
          <title>Artist Name</title>
          <type>d:text</type>
          <index enabled="true">
            <atomic>true</atomic>
            <stored>false</stored>
            <tokenised>false</tokenised>
          </index>
        </property>
        <property name="hip:label">
          <title>Label</title>
          <type>d:text</type>
          <index enabled="true">
            <atomic>true</atomic>
            <stored>false</stored>
            <tokenised>false</tokenised>
          </index>
        </property>
        <property name="hip:origin">
          <title>Origin</title>
          <type>d:text</type>
          <index enabled="true">
            <atomic>true</atomic>
            <stored>false</stored>
            <tokenised>false</tokenised>
          </index>
        </property>
        <property name="hip:genres">
          <title>Genres</title>
          <type>d:text</type>
          <multiple>true</multiple>
          <index enabled="true">
            <atomic>true</atomic>
            <stored>false</stored>
            <tokenised>false</tokenised>
          </index>
          <constraints>
            <constraint ref="hip:genreConst"/>
          </constraints>
        </property>
        <property name="hip:members">
          <title>Members</title>
          <type>d:text</type>
          <multiple>true</multiple>
          <index enabled="true">
            <atomic>true</atomic>
            <stored>false</stored>
            <tokenised>false</tokenised>
          </index>
        </property>
        <property name="hip:formed">
          <title>Date Formed</title>
          <type>d:date</type>
        </property>
        <property name="hip:disbanded">
          <title>Date Disbanded</title>
          <type>d:date</type>
        </property>
      </properties>
    </type>
  </types>
</model>

For now we'll assume that you know that you'll have to bootstrap the content model XML using a Spring context file and we'll still to the important files. Next up, we a Spring context file to bootstrap our XPathMetadata Extraction configuration. The most important part of this Spring context file is where we bootstrap the hipster-model-mappings.properties file and the hipster-model-xpath-mappings.properties file in the extracter.xml.HipsterModelMetadataExtracter bean definition. 

hipster-xml-metadata-extraction-context.xml:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<!-- Configurations for XmlMetadataExtracters -->
<beans>
   <!-- An extractor that operates on Alfresco models -->
   <bean id="extracter.xml.HipsterModelMetadataExtracter"
         class="org.alfresco.repo.content.metadata.xml.XPathMetadataExtracter"
         parent="baseMetadataExtracter"
         init-method="init" >
      <property name="mappingProperties">
         <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
            <property name="location">
               <value>classpath:alfresco/module/massnerder-blog-xpath-metadata-extraction/metadata/extraction/hipster-model-mappings.properties</value>
            </property>
         </bean>
      </property>
      <property name="xpathMappingProperties">
         <bean class="org.springframework.beans.factory.config.PropertiesFactoryBean">
            <property name="location">
               <value>classpath:alfresco/module/massnerder-blog-xpath-metadata-extraction/metadata/extraction/hipster-model-xpath-mappings.properties</value>
            </property>
         </bean>
      </property>
   </bean>

   <!-- A selector that executes XPath statements -->
   <bean
         id="extracter.xml.selector.HipsterXPathSelector"
         class="org.alfresco.repo.content.selector.XPathContentWorkerSelector"
         init-method="init">
      <property name="workers">
         <map>
            <entry key="/*">
               <ref bean="extracter.xml.HipsterModelMetadataExtracter" />
            </entry>
         </map>
      </property>
   </bean>

   <!-- The wrapper XML metadata extracter -->
   <bean
         id="extracter.xml.HipsterXMLMetadataExtracter"
         class="org.alfresco.repo.content.metadata.xml.XmlMetadataExtracter"
         parent="baseMetadataExtracter">
      <property name="overwritePolicy">
         <value>EAGER</value>
      </property>
      <property name="selectors">
         <list>
            <ref bean="extracter.xml.selector.HipsterXPathSelector" />
         </list>
      </property>
   </bean>
</beans>

Let's have a closer look at the hipster-model-mappings.properties. The main purpose of this file is to tell the XPathMetadataExtracter class which content model and which metadata properties we will be using during the extraction.

 hipster-model-mappings.properties:

# Namespaces
namespace.prefix.hip=http://www.massnerder.io/model/1.0

# Mappings
artistName=hip:artistName
label=hip:label
origin=hip:origin
genres=hip:genres
members=hip:members
formed=hip:formed
disbanded=hip:disbanded

The hipster-model-xpath-mappings.properties is where all the magic happens. This properties file will use metadata property names defined in the previous hipster-model-mappings.properties properties file and will use a corresponding XPath expression to extract values in XML files we'll upload afterwards. 

hipster-model-xpath-mappings.properties

# Hipster Property XPath Mappings
artistName=/artist/@name
label=/artist/label
origin=/artist/origin
genres=/artist/genres/genre/text()
members=/artist/members/member/text()
formed=/artist/formed
disbanded=/artist/disbanded

And lastly we have a sample artist XML file with the following content that we'll upload to a Share collaboration site.

shakey-graves.xml

<?xml version="1.0" encoding="UTF-8"?>
<artist name="Shakey Graves">
    <label>Indepedent</label>
    <origin>Austin, Texas, USA</origin>
    <genres>
        <genre>Indie Americana</genre>
    </genres>
    <members>
        <member>Alejandro Rose-Garcia</member>
    </members>
    <formed>2007</formed>
    <disbanded></disbanded>
</artist>

Then we upload the shakey-graves.xml file to our Discography collaboration site and specialize it to the Artist content type, we have the following results:

METADATA EXTRACTION RESULTS IN ALFRSCO SHARE

 

Check out this video to see XPath Metadata Extraction working in real-time...

 

In true hipster fashion, we've taken an old feature and brought it back to life. Hipsters and a hipster trends get a bad wrap, but I think XPath Metadata Extraction has proven to be pretty useful in my experience. So get out there and grow an ironic mustache, make your own clothing, and pickle things that should never be pickled! When you've got all your hipster gear in check, grab the source from this post on GitHub here. 

Also come join us at Alfresco Day in San Francisco on August 4th, 2015. The registration page and event details can be found here!

Keep Calm and Hipster On!!!

Alfresco + Apache ManifoldCF + ElasticSearch: Part 2

This post will build on Alfresco + Apache ManifoldCF + ElasticSearch: Part 1. Specifically, we're going to add a Google Drive connector and a file system connector. The main goal for this post is to better demonstrate how we can aggregate indices from multiple repositories and how we can possibly satisfy federated search requirements. 

Before we can integrate Google Drive with ManifoldCF, we'll need to complete a few pre-requisites:

  1. Create a new Google API project
  2. Enable the Google Drive API and SDK services
  3. Create a new Client Id
  4. Generate a long-living Refresh Token  

 

Create a new Google API project

Start by navigating to Google Code API's Console. If you haven't created an API project before, then you'll be presented with the following screen:

Enable the Google Drive API and SDK Services

After creating your project, navigate to the Services section in the left-hand menu. In the services section, enable the Drive API and Drive SDK options. 

Create a new Client Id

Switch to the API Access menu, then click the Create an OAuth 2.0 client ID button. Name your project and click the next button:

Choose "Web application" for the application type, change the https dropdown to http, change the hostname to localhost, and click the Create client ID button:

In the following screen, click the Edit Settings...link and add https://developers.google.com/oauthplayground to the redirect URIs list. Record the Client ID and Client Secret values for later use:

Generate a long-living Refresh Token

If we've done everything correctly, we shouldn't need to go back to the Google API's Console. The next stop is the Google Developers OAuth 2.0 Playground

Once your'e in the OAuth Playground, click the settings/gear/cog button on the right-hand side. Check the "Use your own OAuth credentials" checkbox and enter the Client Id and Client Secret values we recorded from the Google Code API's console.    

In the "Step 1 Select & authorize APIs" menu, scroll down to Drive API v2, expand that menu option, select "https://www.googleapis.com/auth/drive.readonly," and click the Authorize APIs button. When you are prompted to allow access for the Apache ManifoldCF application to your Google Drive account, choose Accept.     

At this point, an Authorization code should have been automatically generated. Now we can generate a long-living Refresh Token by clicking the "Exchange authorization code for tokens" button. Record the Refresh Token value for use in the ManifoldCF configuration. 

Now that the pre-requisites are sorted, we can move onto configuring our Google Drive repository connection in ManifoldCF. In the Repository Connections List, click the "Add new connection" link. Provide a name and description for your Google Drive connection.  

On the Type tab, select GoogleDrive as connection type, leave the authority type as the default value and click continue.   

Keep the default values on the Throttling tab, then continue to the Server tab. Enter the values for the RefreshToken, Client Id, and the Client Secret Id that we previously generated, then click the save button.   

Let's go back to the Repository Connections List, click the "Add new connection" link. Provide a name and description for our File System connection.  

On the Type tab, select File as connection type, leave the authority type as the default value and click continue. 

Keep the default values on the Throttling tab, then click the save button and our File System connection should be ready to go.

Before we create jobs to crawl Google Drive and the file system, lets prepare some sample invoice documents. 

Alright. Now let's create a new job to crawl Google Drive and export to ElasticSearch.  On the Job List page, click the "Add a new job" link and give your new job a name.

On the Connection tab, choose "ElasticSearch" as the Output connection, "Google Drive" as the repo connection, change the priority to "1 (Highest)" and change the "Start method to "Start even inside a schedule window."

Click the continue button and go to the Scheduling tab. Change the Schedule type to "Rescan documents dynamically" and leave the rest of the values as the defaults for now. 

Keep the defaults for every other tab and move to the See Query tab. Let's enter a query that will search for documents where the title contains "Invoice", the mimetype is equal to "image/tiff" and lastly let's exclude documents in the trash folder. For more on Google Drive's query syntax, refer to the following documentation:

 https://developers.google.com/drive/search-parameters

Finally, click save.

Our Google Drive job hasn't started yet. So let's click the "Start" link to force the job to start outside of the regular schedule window. After a few moments, we should see the Documents and Processed counts increment by 1.

We've got one more job to configure. So let's create a new job to crawl the file system and export to ElasticSearch. On the Job List page, click the "Add a new job" link and give your new job a name.

On the Connection tab, choose "ElasticSearch" as the Output connection, "File System" as the repo connection, change the priority to "1 (Highest)" and change the "Start method to "Start even inside a schedule window."

Click the continue button and go to the Scheduling tab. Change the Schedule type to "Rescan documents dynamically" and leave the rest of the values as the defaults for now.

Switch over to the Repository Paths tab, add the path to your invoices directory on the file system. "/Developer/Alfresco/alfresco-manifoldcf-elasticsearch/Invoices" in my case. Remove the option to include folders since our invoice directory hierarchy is flat. Finally, click the save button.   

Now let's switch the to "Status and Job Management" page for the last time.

Our File System - ElasticSearch Job hasn't started yet. So let's click the "Start" link to force the job to start outside of the regular schedule window. After a few moments, we should see the Documents and Processed counts increment.  

Time to query ElasticSearch again. This time, we're going to add an extra "title" column for our Google Drive document. 

curl -XGET http://localhost:9200/massnerder/invoices/_search -d '
{
"fields" : ["cmis:name", "title"],
"query" : {
"query_string" : {
"query":"Invoice*"
}
}
}
'

Voilà!!!

{
"_shards": {
"failed": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
"_id": "https://doc-14-18-docs.googleusercontent.com/docs/securesc/amkuuenev7losasm7hb5lnkpvd88ufog/vd21rehnm9e62dbi10b4eiefjkhjlhvt/1377482400000/06874029851172517131/06874029851172517131/0BxcKfzVaCdtBemdpRDVuQlM4V0k?h=10387061918782038444&e=download&gd=true",
"_index": "massnerder",
"_score": 1.0,
"_type": "invoices",
"fields": {
"title": "Invoice02.tif"
}
},
{
"_id": "http://localhost:8080/alfresco/cmisatom/1699d3d2-890b-4b0b-9780-09fa8a487562/content/Invoice01.tif?id=workspace%3A%2F%2FSpacesStore%2F0400fdcd-78fd-45bc-8f49-24a53c26182c%3B1.0",
"_index": "massnerder",
"_score": 1.0,
"_type": "invoices",
"fields": {
"cmis:name": "Invoice01.tif"
}
},
{
"_id": "file:/Developer/Alfresco/alfresco-manifoldcf-elasticsearch/Invoices/Invoice03.tif",
"_index": "massnerder",
"_score": 1.0,
"_type": "invoices"
}
],
"max_score": 1.0,
"total": 3
},
"timed_out": false,
"took": 2
}

Just to recap...we crawled Alfresco, Google Drive, and a File System directory for invoices and exported them to a single ElasticSearch instance. Now that's was extremely easy and pretty darn awesome.

- Keep Calm and Nerd On

Alfresco + Apache ManifoldCF + ElasticSearch: Part 1

In this post, I'll try and cover the following items:

  • What is Apache ManifoldCF?
  • What benefits can ManifoldCF bring to Alfresco?
  • How to configure Alfresco and ElasticSearch connectors in ManifoldCF.

 

What is Apache ManifoldCF?

     Apache ManifoldCF is a framework that can be used to synchronize documents and their associated metadata from multiple repositories, such as Alfresco, with search engines or other output targets. In addition, ManifoldCF has the ability to maintain ACL's from the source repository.

     

    What benefits can ManifoldCF bring to Alfresco?

    One of the first benefits that come to find is the ability to integrate with a centralized search server that is leveraged across your enterprise. Do you see that? Yes, that's a light at the end of the Federated Search tunnel. This could lead you down a few paths....

    1. Indexes are duplicated between the Alfresco implementation of Solr and your centralized search server of choice.
    2. You completely decouple the search and indexing tier from Alfresco and integrate the Alfresco SearchService with centralized search server.

    Another potential application for ManifoldCF is migrations. The ManifoldCF API's coupled with a way to map and massage metadata could prove to be a very powerful combination.  

     

    How to configure Alfresco and ElasticSearch connectors

      Download  Apache ManifoldCF:

      wget http://mirror.tcpdiag.net/apache/manifoldcf/apache-manifoldcf-1.3-bin.tar.gz

      Extract the Apache ManifoldCF tarball :

      tar -zxvf apache-manifoldcf-1.3-bin.tar.gz

      Change dir to the "apache-manifoldcf-1.3/example" directory :

      cd /apache-manifoldcf-1.3/example

      Start ManifoldCF:

      java -jar start.jar

      Confirm that you can reach the ManifoldCF Crawler UI (http://localhost:8345/mcf-crawler-ui/login.jsp) and log in with username=admin, password=admin:

      Download ElasticSearch:

      wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.3.tar.gz

      Extract the ElasticSearch tarball:

      tar -zxvf elasticsearch-0.90.3.tar.gz

      Change directory to elasticsearch-0.90.3/bin:

      cd elasticsearch-0.90.3/bin

      Start ElasticeSearch:

      ./elasticsearch -f
      [2013-08-19 16:36:19,885][INFO ][node ] [Ogre] version[0.90.3], pid[91606], build[5c38d60/2013-08-06T13:18:31Z]
      [2013-08-19 16:36:19,885][INFO ][node ] [Ogre] initializing ...
      [2013-08-19 16:36:19,891][INFO ][plugins ] [Ogre] loaded [], sites []
      [2013-08-19 16:36:21,736][INFO ][node ] [Ogre] initialized
      [2013-08-19 16:36:21,736][INFO ][node ] [Ogre] starting ...
      [2013-08-19 16:36:21,825][INFO ][transport ] [Ogre] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/X.X.X.X:9300]}
      [2013-08-19 16:36:24,850][INFO ][cluster.service ] [Ogre] new_master [Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]], reason: zen-disco-join (elected_as_master)
      [2013-08-19 16:36:24,869][INFO ][discovery ] [Ogre] elasticsearch/InMryYlLS_2MW9HCtu8i7Q
      [2013-08-19 16:36:24,878][INFO ][http ] [Ogre] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/X.X.X.X:9200]}
      [2013-08-19 16:36:24,878][INFO ][node ] [Ogre] started
      [2013-08-19 16:36:24,892][INFO ][gateway ] [Ogre] recovered [0] indices into cluster_state

      Confirm that ElasticSearch is up and running: 

      curl -XGET localhost:9200
      {
      "ok" : true,
      "status" : 200,
      "name" : "Carrion",
      "version" : {
      "number" : "0.90.3",
      "build_hash" : "5c38d6076448b899d758f29443329571e2522410",
      "build_timestamp" : "2013-08-06T13:18:31Z",
      "build_snapshot" : false,
      "lucene_version" : "4.4"
      },
      "tagline" : "You Know, for Search"
      }

      Let's configure an Alfresco/CMIS Repository Connector in ManifoldCF. In the Repository Connections List, click the "Add a new connection" link. Then provide a name and description for your Alfresco/CMIS connection.  

      Select CMIS as the connection type, leave the authority as the default value and click continue.

       

      Keep the default values on the Throttling tab, then continue to Server tab configuration. Enter the AtomPub server values for Alfresco, then click save.

      Confirm that the Alfresco/CMIS connector is working by checking the Connection Status value. 

      Now let's configure an ElasticSearch output connector. In the Output Connections List, click the "Add a new output connection" link. Then provide a name and description for your ElasticSearch output connection. 

      Select ElasticSearch as the connection type, leave the authority as the default value and click continue.

      Keep the default values on the Throttling tab, then continue to Parameters tab. Populate the Server Location (URL) for ElasticSearch. By default the port number will be 9200. For each additional instance of ElasticSearch that you spin up, the port number will increment by 1 (Example: 9201). Next add an index name and index type of your choice. Remember these values because we will need them for executing queries against ElasticSearch. Finally, click save.

       

      Confirm that the ElasticSearch output connector is working by checking the Connection Status value. 

      Before we configure a job to crawl the Alfresco repository, lets add a sample document and specialize the type to "sample:invoice". Note that I've created a sample invoice content type in advance.   

      Because we're good Alfresco enthusiasts (right?) and we like to be thorough, let's test the CMIS search in the Admin Console's Node Browser. 

      We're in the home stretch now. Let's create a job to crawl Alfresco based on our CMIS query above and export to ElasticSearch. On the Job List page, click the "Add a new job" link and give your new job a name.  

      On the Connection tab, choose "ElasticSearch" as the Output connection, "Alfresco CMIS" as the repo connection, change the priority to "1 (Highest)" and change the "Start method to "Start even inside a schedule window. "

      Click the continue button and go to the Scheduling tab. Change the Schedule type to "Rescan documents dynamically" and leave the rest of the values as the defaults for now.  

      Let's skip over to the CMIS Query tab and enter our query..."SELECT * FROM sample:invoice" Finally, let's click save. 

      Now let's switch the to "Status and Job Management" page. 

      We see that our job hasn't started yet. So let's click the "Start" link to force the job to start outside of the regular schedule window. After a few moments, we should see the Documents and Processed counts increment by 1.  

      Now its time to query ElasticSearch to see if our document were indexed as expected. With the following syntax we're going to query ElasticSearch in the index name = massnerder, the index type = invoices, limit the columns returned to cmis:name, and filter the search to only return results where cmis:name starts with the word Invoice.

      curl -XGET http://localhost:9200/massnerder/invoices/_search -d '
      {
      "fields" : ["cmis:name"],
      "query" : {
      "query_string" : {
      "query":"Invoice*"
      }
      }
      }
      '

      Huzzah! We see that the query above returns one hit in the following JSON response: 

      {
      "_shards": {
      "failed": 0,
      "successful": 5,
      "total": 5
      },
      "hits": {
      "hits": [
      {
      "_id": "http://localhost:8080/alfresco/cmisatom/1699d3d2-890b-4b0b-9780-09fa8a487562/content/Invoice01.tif?id=workspace%3A%2F%2FSpacesStore%2F0400fdcd-78fd-45bc-8f49-24a53c26182c%3B1.0",
      "_index": "massnerder",
      "_score": 1.0,
      "_type": "invoices",
      "fields": {
      "cmis:name": "Invoice01.tif"
      }
      }
      ],
      "max_score": 1.0,
      "total": 1
      },
      "timed_out": false,
      "took": 4
      }

      For kicks and giggles, let's spin up another instance of ElasticSearch and query the second node. Below is the command line output from starting our second node:

      ./elasticsearch -f
      [2013-08-19 17:30:16,413][INFO ][node ] [Jerry Jaxon] version[0.90.3], pid[94436], build[5c38d60/2013-08-06T13:18:31Z]
      [2013-08-19 17:30:16,413][INFO ][node ] [Jerry Jaxon] initializing ...
      [2013-08-19 17:30:16,418][INFO ][plugins ] [Jerry Jaxon] loaded [], sites []
      [2013-08-19 17:30:18,255][INFO ][node ] [Jerry Jaxon] initialized
      [2013-08-19 17:30:18,255][INFO ][node ] [Jerry Jaxon] starting ...
      [2013-08-19 17:30:18,339][INFO ][transport ] [Jerry Jaxon] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/X.X.X.X:9301]}
      [2013-08-19 17:30:21,398][INFO ][cluster.service ] [Jerry Jaxon] detected_master [Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]], added {[Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]],}, reason: zen-disco-receive(from master [[Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]]])
      [2013-08-19 17:30:21,417][INFO ][discovery ] [Jerry Jaxon] elasticsearch/JP8HydAVSWmMUbpfaTEZ4Q
      [2013-08-19 17:30:21,421][INFO ][http ] [Jerry Jaxon] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/X.X.X.X:9201]}
      [2013-08-19 17:30:21,421][INFO ][node ] [Jerry Jaxon] started

      In the logging output from the first/master ElasticSearch node, we can see that our new node has joined the cluster:  

      [2013-08-19 17:30:21,389][INFO ][cluster.service          ] [Ogre] added {[Jerry Jaxon][JP8HydAVSWmMUbpfaTEZ4Q][inet[/X.X.X.X:9301]],}, reason: zen-disco-receive(join from node[[Jerry Jaxon][JP8HydAVSWmMUbpfaTEZ4Q][inet[/X.X.X.X:9301]]])

      Now let's execute a query on the second node to see if the indices have replicate properly:

      curl -XGET http://localhost:9201/massnerder/invoices/_search -d '
      {
      "fields" : ["cmis:name"],
      "query" : {
      "query_string" : {
      "query":"Invoice*"
      }
      }
      }
      '

      And just as expected, we received the one hit from the second ElasticSearch node: 

      {
      "_shards": {
      "failed": 0,
      "successful": 5,
      "total": 5
      },
      "hits": {
      "hits": [
      {
      "_id": "http://localhost:8080/alfresco/cmisatom/1699d3d2-890b-4b0b-9780-09fa8a487562/content/Invoice01.tif?id=workspace%3A%2F%2FSpacesStore%2F0400fdcd-78fd-45bc-8f49-24a53c26182c%3B1.0",
      "_index": "massnerder",
      "_score": 1.0,
      "_type": "invoices",
      "fields": {
      "cmis:name": "Invoice01.tif"
      }
      }
      ],
      "max_score": 1.0,
      "total": 1
      },
      "timed_out": false,
      "took": 56
      }

      Tadaaaa! With absolutely zero coding and a little configuration we used Apache ManifoldCF to crawl Alfresco and exported the results to ElasticSearch. That wasn't so hard was it? At this point we dont have Alfresco's SearchService pointing to ElasticSearch, so we can't leverage ElasticSearch within the Alfresco Share UI or the public services. Other than that, I think that this was pretty awesome nonetheless. 

       

      I want to give a few shout outs to my friends at Simflofy . They've built their product on top of ManifoldCF and have created some additional features that provide a much more polished end-to-end enterprise product, so take a gander when you get a chance.

      In addition, you can find another blog post on Alfresco, ManifoldCF, and ElasticSearch from one of our top Alfresco partners (Zaizi) here: 

      http://www.zaizi.com/blog/the-next-search-generation-in-alfresco-is-here 

      Finally, I want to give a shout out to my colleague Maurizio Pillitu and the SourceSense team that is working on the Alfresco Web Script repository connection for ManifoldCF. If you would like to contribute, you can find the GitHub project here:

      https://github.com/maoo/alfresco-webscript-manifold-connector 

      Stay tuned for Part 2 where I will build on top of what we accomplished in this article and add Google Drive and file system connectors into the mix. 

      - Keep Calm and Nerd On

      Mass Nerder

      This is my inaugural blog post! Woo hoo! Alright!

      I suppose that Mass Nerder   requires a bit of an explanation. I promise its not murder-y. Its actually a song title from one of my favorite punk rock bands, The Descendents.    If you already know who they are, you've earned extra cool points in my book. If not, dive into the wonderful world of endless links known as Wikipedia....

      Most of the entries in this section will be technology and Alfresco related, but just this once I'll break the rules. 

      - Keep Calm and Nerd On