Alfresco + Apache ManifoldCF + ElasticSearch: Part 1

In this post, I'll try and cover the following items:

  • What is Apache ManifoldCF?
  • What benefits can ManifoldCF bring to Alfresco?
  • How to configure Alfresco and ElasticSearch connectors in ManifoldCF.

 

What is Apache ManifoldCF?

     Apache ManifoldCF is a framework that can be used to synchronize documents and their associated metadata from multiple repositories, such as Alfresco, with search engines or other output targets. In addition, ManifoldCF has the ability to maintain ACL's from the source repository.

     

    What benefits can ManifoldCF bring to Alfresco?

    One of the first benefits that come to find is the ability to integrate with a centralized search server that is leveraged across your enterprise. Do you see that? Yes, that's a light at the end of the Federated Search tunnel. This could lead you down a few paths....

    1. Indexes are duplicated between the Alfresco implementation of Solr and your centralized search server of choice.
    2. You completely decouple the search and indexing tier from Alfresco and integrate the Alfresco SearchService with centralized search server.

    Another potential application for ManifoldCF is migrations. The ManifoldCF API's coupled with a way to map and massage metadata could prove to be a very powerful combination.  

     

    How to configure Alfresco and ElasticSearch connectors

      Download  Apache ManifoldCF:

      wget http://mirror.tcpdiag.net/apache/manifoldcf/apache-manifoldcf-1.3-bin.tar.gz

      Extract the Apache ManifoldCF tarball :

      tar -zxvf apache-manifoldcf-1.3-bin.tar.gz

      Change dir to the "apache-manifoldcf-1.3/example" directory :

      cd /apache-manifoldcf-1.3/example

      Start ManifoldCF:

      java -jar start.jar

      Confirm that you can reach the ManifoldCF Crawler UI (http://localhost:8345/mcf-crawler-ui/login.jsp) and log in with username=admin, password=admin:

      Download ElasticSearch:

      wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.3.tar.gz

      Extract the ElasticSearch tarball:

      tar -zxvf elasticsearch-0.90.3.tar.gz

      Change directory to elasticsearch-0.90.3/bin:

      cd elasticsearch-0.90.3/bin

      Start ElasticeSearch:

      ./elasticsearch -f
      [2013-08-19 16:36:19,885][INFO ][node ] [Ogre] version[0.90.3], pid[91606], build[5c38d60/2013-08-06T13:18:31Z]
      [2013-08-19 16:36:19,885][INFO ][node ] [Ogre] initializing ...
      [2013-08-19 16:36:19,891][INFO ][plugins ] [Ogre] loaded [], sites []
      [2013-08-19 16:36:21,736][INFO ][node ] [Ogre] initialized
      [2013-08-19 16:36:21,736][INFO ][node ] [Ogre] starting ...
      [2013-08-19 16:36:21,825][INFO ][transport ] [Ogre] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/X.X.X.X:9300]}
      [2013-08-19 16:36:24,850][INFO ][cluster.service ] [Ogre] new_master [Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]], reason: zen-disco-join (elected_as_master)
      [2013-08-19 16:36:24,869][INFO ][discovery ] [Ogre] elasticsearch/InMryYlLS_2MW9HCtu8i7Q
      [2013-08-19 16:36:24,878][INFO ][http ] [Ogre] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/X.X.X.X:9200]}
      [2013-08-19 16:36:24,878][INFO ][node ] [Ogre] started
      [2013-08-19 16:36:24,892][INFO ][gateway ] [Ogre] recovered [0] indices into cluster_state

      Confirm that ElasticSearch is up and running: 

      curl -XGET localhost:9200
      {
      "ok" : true,
      "status" : 200,
      "name" : "Carrion",
      "version" : {
      "number" : "0.90.3",
      "build_hash" : "5c38d6076448b899d758f29443329571e2522410",
      "build_timestamp" : "2013-08-06T13:18:31Z",
      "build_snapshot" : false,
      "lucene_version" : "4.4"
      },
      "tagline" : "You Know, for Search"
      }

      Let's configure an Alfresco/CMIS Repository Connector in ManifoldCF. In the Repository Connections List, click the "Add a new connection" link. Then provide a name and description for your Alfresco/CMIS connection.  

      Select CMIS as the connection type, leave the authority as the default value and click continue.

       

      Keep the default values on the Throttling tab, then continue to Server tab configuration. Enter the AtomPub server values for Alfresco, then click save.

      Confirm that the Alfresco/CMIS connector is working by checking the Connection Status value. 

      Now let's configure an ElasticSearch output connector. In the Output Connections List, click the "Add a new output connection" link. Then provide a name and description for your ElasticSearch output connection. 

      Select ElasticSearch as the connection type, leave the authority as the default value and click continue.

      Keep the default values on the Throttling tab, then continue to Parameters tab. Populate the Server Location (URL) for ElasticSearch. By default the port number will be 9200. For each additional instance of ElasticSearch that you spin up, the port number will increment by 1 (Example: 9201). Next add an index name and index type of your choice. Remember these values because we will need them for executing queries against ElasticSearch. Finally, click save.

       

      Confirm that the ElasticSearch output connector is working by checking the Connection Status value. 

      Before we configure a job to crawl the Alfresco repository, lets add a sample document and specialize the type to "sample:invoice". Note that I've created a sample invoice content type in advance.   

      Because we're good Alfresco enthusiasts (right?) and we like to be thorough, let's test the CMIS search in the Admin Console's Node Browser. 

      We're in the home stretch now. Let's create a job to crawl Alfresco based on our CMIS query above and export to ElasticSearch. On the Job List page, click the "Add a new job" link and give your new job a name.  

      On the Connection tab, choose "ElasticSearch" as the Output connection, "Alfresco CMIS" as the repo connection, change the priority to "1 (Highest)" and change the "Start method to "Start even inside a schedule window. "

      Click the continue button and go to the Scheduling tab. Change the Schedule type to "Rescan documents dynamically" and leave the rest of the values as the defaults for now.  

      Let's skip over to the CMIS Query tab and enter our query..."SELECT * FROM sample:invoice" Finally, let's click save. 

      Now let's switch the to "Status and Job Management" page. 

      We see that our job hasn't started yet. So let's click the "Start" link to force the job to start outside of the regular schedule window. After a few moments, we should see the Documents and Processed counts increment by 1.  

      Now its time to query ElasticSearch to see if our document were indexed as expected. With the following syntax we're going to query ElasticSearch in the index name = massnerder, the index type = invoices, limit the columns returned to cmis:name, and filter the search to only return results where cmis:name starts with the word Invoice.

      curl -XGET http://localhost:9200/massnerder/invoices/_search -d '
      {
      "fields" : ["cmis:name"],
      "query" : {
      "query_string" : {
      "query":"Invoice*"
      }
      }
      }
      '

      Huzzah! We see that the query above returns one hit in the following JSON response: 

      {
      "_shards": {
      "failed": 0,
      "successful": 5,
      "total": 5
      },
      "hits": {
      "hits": [
      {
      "_id": "http://localhost:8080/alfresco/cmisatom/1699d3d2-890b-4b0b-9780-09fa8a487562/content/Invoice01.tif?id=workspace%3A%2F%2FSpacesStore%2F0400fdcd-78fd-45bc-8f49-24a53c26182c%3B1.0",
      "_index": "massnerder",
      "_score": 1.0,
      "_type": "invoices",
      "fields": {
      "cmis:name": "Invoice01.tif"
      }
      }
      ],
      "max_score": 1.0,
      "total": 1
      },
      "timed_out": false,
      "took": 4
      }

      For kicks and giggles, let's spin up another instance of ElasticSearch and query the second node. Below is the command line output from starting our second node:

      ./elasticsearch -f
      [2013-08-19 17:30:16,413][INFO ][node ] [Jerry Jaxon] version[0.90.3], pid[94436], build[5c38d60/2013-08-06T13:18:31Z]
      [2013-08-19 17:30:16,413][INFO ][node ] [Jerry Jaxon] initializing ...
      [2013-08-19 17:30:16,418][INFO ][plugins ] [Jerry Jaxon] loaded [], sites []
      [2013-08-19 17:30:18,255][INFO ][node ] [Jerry Jaxon] initialized
      [2013-08-19 17:30:18,255][INFO ][node ] [Jerry Jaxon] starting ...
      [2013-08-19 17:30:18,339][INFO ][transport ] [Jerry Jaxon] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/X.X.X.X:9301]}
      [2013-08-19 17:30:21,398][INFO ][cluster.service ] [Jerry Jaxon] detected_master [Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]], added {[Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]],}, reason: zen-disco-receive(from master [[Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]]])
      [2013-08-19 17:30:21,417][INFO ][discovery ] [Jerry Jaxon] elasticsearch/JP8HydAVSWmMUbpfaTEZ4Q
      [2013-08-19 17:30:21,421][INFO ][http ] [Jerry Jaxon] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/X.X.X.X:9201]}
      [2013-08-19 17:30:21,421][INFO ][node ] [Jerry Jaxon] started

      In the logging output from the first/master ElasticSearch node, we can see that our new node has joined the cluster:  

      [2013-08-19 17:30:21,389][INFO ][cluster.service          ] [Ogre] added {[Jerry Jaxon][JP8HydAVSWmMUbpfaTEZ4Q][inet[/X.X.X.X:9301]],}, reason: zen-disco-receive(join from node[[Jerry Jaxon][JP8HydAVSWmMUbpfaTEZ4Q][inet[/X.X.X.X:9301]]])

      Now let's execute a query on the second node to see if the indices have replicate properly:

      curl -XGET http://localhost:9201/massnerder/invoices/_search -d '
      {
      "fields" : ["cmis:name"],
      "query" : {
      "query_string" : {
      "query":"Invoice*"
      }
      }
      }
      '

      And just as expected, we received the one hit from the second ElasticSearch node: 

      {
      "_shards": {
      "failed": 0,
      "successful": 5,
      "total": 5
      },
      "hits": {
      "hits": [
      {
      "_id": "http://localhost:8080/alfresco/cmisatom/1699d3d2-890b-4b0b-9780-09fa8a487562/content/Invoice01.tif?id=workspace%3A%2F%2FSpacesStore%2F0400fdcd-78fd-45bc-8f49-24a53c26182c%3B1.0",
      "_index": "massnerder",
      "_score": 1.0,
      "_type": "invoices",
      "fields": {
      "cmis:name": "Invoice01.tif"
      }
      }
      ],
      "max_score": 1.0,
      "total": 1
      },
      "timed_out": false,
      "took": 56
      }

      Tadaaaa! With absolutely zero coding and a little configuration we used Apache ManifoldCF to crawl Alfresco and exported the results to ElasticSearch. That wasn't so hard was it? At this point we dont have Alfresco's SearchService pointing to ElasticSearch, so we can't leverage ElasticSearch within the Alfresco Share UI or the public services. Other than that, I think that this was pretty awesome nonetheless. 

       

      I want to give a few shout outs to my friends at Simflofy . They've built their product on top of ManifoldCF and have created some additional features that provide a much more polished end-to-end enterprise product, so take a gander when you get a chance.

      In addition, you can find another blog post on Alfresco, ManifoldCF, and ElasticSearch from one of our top Alfresco partners (Zaizi) here: 

      http://www.zaizi.com/blog/the-next-search-generation-in-alfresco-is-here 

      Finally, I want to give a shout out to my colleague Maurizio Pillitu and the SourceSense team that is working on the Alfresco Web Script repository connection for ManifoldCF. If you would like to contribute, you can find the GitHub project here:

      https://github.com/maoo/alfresco-webscript-manifold-connector 

      Stay tuned for Part 2 where I will build on top of what we accomplished in this article and add Google Drive and file system connectors into the mix. 

      - Keep Calm and Nerd On