Alfresco + Apache ManifoldCF + ElasticSearch: Part 2

This post will build on Alfresco + Apache ManifoldCF + ElasticSearch: Part 1. Specifically, we're going to add a Google Drive connector and a file system connector. The main goal for this post is to better demonstrate how we can aggregate indices from multiple repositories and how we can possibly satisfy federated search requirements. 

Before we can integrate Google Drive with ManifoldCF, we'll need to complete a few pre-requisites:

  1. Create a new Google API project
  2. Enable the Google Drive API and SDK services
  3. Create a new Client Id
  4. Generate a long-living Refresh Token  

 

Create a new Google API project

Start by navigating to Google Code API's Console. If you haven't created an API project before, then you'll be presented with the following screen:

Enable the Google Drive API and SDK Services

After creating your project, navigate to the Services section in the left-hand menu. In the services section, enable the Drive API and Drive SDK options. 

Create a new Client Id

Switch to the API Access menu, then click the Create an OAuth 2.0 client ID button. Name your project and click the next button:

Choose "Web application" for the application type, change the https dropdown to http, change the hostname to localhost, and click the Create client ID button:

In the following screen, click the Edit Settings...link and add https://developers.google.com/oauthplayground to the redirect URIs list. Record the Client ID and Client Secret values for later use:

Generate a long-living Refresh Token

If we've done everything correctly, we shouldn't need to go back to the Google API's Console. The next stop is the Google Developers OAuth 2.0 Playground

Once your'e in the OAuth Playground, click the settings/gear/cog button on the right-hand side. Check the "Use your own OAuth credentials" checkbox and enter the Client Id and Client Secret values we recorded from the Google Code API's console.    

In the "Step 1 Select & authorize APIs" menu, scroll down to Drive API v2, expand that menu option, select "https://www.googleapis.com/auth/drive.readonly," and click the Authorize APIs button. When you are prompted to allow access for the Apache ManifoldCF application to your Google Drive account, choose Accept.     

At this point, an Authorization code should have been automatically generated. Now we can generate a long-living Refresh Token by clicking the "Exchange authorization code for tokens" button. Record the Refresh Token value for use in the ManifoldCF configuration. 

Now that the pre-requisites are sorted, we can move onto configuring our Google Drive repository connection in ManifoldCF. In the Repository Connections List, click the "Add new connection" link. Provide a name and description for your Google Drive connection.  

On the Type tab, select GoogleDrive as connection type, leave the authority type as the default value and click continue.   

Keep the default values on the Throttling tab, then continue to the Server tab. Enter the values for the RefreshToken, Client Id, and the Client Secret Id that we previously generated, then click the save button.   

Let's go back to the Repository Connections List, click the "Add new connection" link. Provide a name and description for our File System connection.  

On the Type tab, select File as connection type, leave the authority type as the default value and click continue. 

Keep the default values on the Throttling tab, then click the save button and our File System connection should be ready to go.

Before we create jobs to crawl Google Drive and the file system, lets prepare some sample invoice documents. 

Alright. Now let's create a new job to crawl Google Drive and export to ElasticSearch.  On the Job List page, click the "Add a new job" link and give your new job a name.

On the Connection tab, choose "ElasticSearch" as the Output connection, "Google Drive" as the repo connection, change the priority to "1 (Highest)" and change the "Start method to "Start even inside a schedule window."

Click the continue button and go to the Scheduling tab. Change the Schedule type to "Rescan documents dynamically" and leave the rest of the values as the defaults for now. 

Keep the defaults for every other tab and move to the See Query tab. Let's enter a query that will search for documents where the title contains "Invoice", the mimetype is equal to "image/tiff" and lastly let's exclude documents in the trash folder. For more on Google Drive's query syntax, refer to the following documentation:

 https://developers.google.com/drive/search-parameters

Finally, click save.

Our Google Drive job hasn't started yet. So let's click the "Start" link to force the job to start outside of the regular schedule window. After a few moments, we should see the Documents and Processed counts increment by 1.

We've got one more job to configure. So let's create a new job to crawl the file system and export to ElasticSearch. On the Job List page, click the "Add a new job" link and give your new job a name.

On the Connection tab, choose "ElasticSearch" as the Output connection, "File System" as the repo connection, change the priority to "1 (Highest)" and change the "Start method to "Start even inside a schedule window."

Click the continue button and go to the Scheduling tab. Change the Schedule type to "Rescan documents dynamically" and leave the rest of the values as the defaults for now.

Switch over to the Repository Paths tab, add the path to your invoices directory on the file system. "/Developer/Alfresco/alfresco-manifoldcf-elasticsearch/Invoices" in my case. Remove the option to include folders since our invoice directory hierarchy is flat. Finally, click the save button.   

Now let's switch the to "Status and Job Management" page for the last time.

Our File System - ElasticSearch Job hasn't started yet. So let's click the "Start" link to force the job to start outside of the regular schedule window. After a few moments, we should see the Documents and Processed counts increment.  

Time to query ElasticSearch again. This time, we're going to add an extra "title" column for our Google Drive document. 

curl -XGET http://localhost:9200/massnerder/invoices/_search -d '
{
"fields" : ["cmis:name", "title"],
"query" : {
"query_string" : {
"query":"Invoice*"
}
}
}
'

VoilĂ !!!

{
"_shards": {
"failed": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
"_id": "https://doc-14-18-docs.googleusercontent.com/docs/securesc/amkuuenev7losasm7hb5lnkpvd88ufog/vd21rehnm9e62dbi10b4eiefjkhjlhvt/1377482400000/06874029851172517131/06874029851172517131/0BxcKfzVaCdtBemdpRDVuQlM4V0k?h=10387061918782038444&e=download&gd=true",
"_index": "massnerder",
"_score": 1.0,
"_type": "invoices",
"fields": {
"title": "Invoice02.tif"
}
},
{
"_id": "http://localhost:8080/alfresco/cmisatom/1699d3d2-890b-4b0b-9780-09fa8a487562/content/Invoice01.tif?id=workspace%3A%2F%2FSpacesStore%2F0400fdcd-78fd-45bc-8f49-24a53c26182c%3B1.0",
"_index": "massnerder",
"_score": 1.0,
"_type": "invoices",
"fields": {
"cmis:name": "Invoice01.tif"
}
},
{
"_id": "file:/Developer/Alfresco/alfresco-manifoldcf-elasticsearch/Invoices/Invoice03.tif",
"_index": "massnerder",
"_score": 1.0,
"_type": "invoices"
}
],
"max_score": 1.0,
"total": 3
},
"timed_out": false,
"took": 2
}

Just to recap...we crawled Alfresco, Google Drive, and a File System directory for invoices and exported them to a single ElasticSearch instance. Now that's was extremely easy and pretty darn awesome.

- Keep Calm and Nerd On

Alfresco + Apache ManifoldCF + ElasticSearch: Part 1

In this post, I'll try and cover the following items:

  • What is Apache ManifoldCF?
  • What benefits can ManifoldCF bring to Alfresco?
  • How to configure Alfresco and ElasticSearch connectors in ManifoldCF.

 

What is Apache ManifoldCF?

     Apache ManifoldCF is a framework that can be used to synchronize documents and their associated metadata from multiple repositories, such as Alfresco, with search engines or other output targets. In addition, ManifoldCF has the ability to maintain ACL's from the source repository.

     

    What benefits can ManifoldCF bring to Alfresco?

    One of the first benefits that come to find is the ability to integrate with a centralized search server that is leveraged across your enterprise. Do you see that? Yes, that's a light at the end of the Federated Search tunnel. This could lead you down a few paths....

    1. Indexes are duplicated between the Alfresco implementation of Solr and your centralized search server of choice.
    2. You completely decouple the search and indexing tier from Alfresco and integrate the Alfresco SearchService with centralized search server.

    Another potential application for ManifoldCF is migrations. The ManifoldCF API's coupled with a way to map and massage metadata could prove to be a very powerful combination.  

     

    How to configure Alfresco and ElasticSearch connectors

      Download  Apache ManifoldCF:

      wget http://mirror.tcpdiag.net/apache/manifoldcf/apache-manifoldcf-1.3-bin.tar.gz

      Extract the Apache ManifoldCF tarball :

      tar -zxvf apache-manifoldcf-1.3-bin.tar.gz

      Change dir to the "apache-manifoldcf-1.3/example" directory :

      cd /apache-manifoldcf-1.3/example

      Start ManifoldCF:

      java -jar start.jar

      Confirm that you can reach the ManifoldCF Crawler UI (http://localhost:8345/mcf-crawler-ui/login.jsp) and log in with username=admin, password=admin:

      Download ElasticSearch:

      wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.3.tar.gz

      Extract the ElasticSearch tarball:

      tar -zxvf elasticsearch-0.90.3.tar.gz

      Change directory to elasticsearch-0.90.3/bin:

      cd elasticsearch-0.90.3/bin

      Start ElasticeSearch:

      ./elasticsearch -f
      [2013-08-19 16:36:19,885][INFO ][node ] [Ogre] version[0.90.3], pid[91606], build[5c38d60/2013-08-06T13:18:31Z]
      [2013-08-19 16:36:19,885][INFO ][node ] [Ogre] initializing ...
      [2013-08-19 16:36:19,891][INFO ][plugins ] [Ogre] loaded [], sites []
      [2013-08-19 16:36:21,736][INFO ][node ] [Ogre] initialized
      [2013-08-19 16:36:21,736][INFO ][node ] [Ogre] starting ...
      [2013-08-19 16:36:21,825][INFO ][transport ] [Ogre] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/X.X.X.X:9300]}
      [2013-08-19 16:36:24,850][INFO ][cluster.service ] [Ogre] new_master [Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]], reason: zen-disco-join (elected_as_master)
      [2013-08-19 16:36:24,869][INFO ][discovery ] [Ogre] elasticsearch/InMryYlLS_2MW9HCtu8i7Q
      [2013-08-19 16:36:24,878][INFO ][http ] [Ogre] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/X.X.X.X:9200]}
      [2013-08-19 16:36:24,878][INFO ][node ] [Ogre] started
      [2013-08-19 16:36:24,892][INFO ][gateway ] [Ogre] recovered [0] indices into cluster_state

      Confirm that ElasticSearch is up and running: 

      curl -XGET localhost:9200
      {
      "ok" : true,
      "status" : 200,
      "name" : "Carrion",
      "version" : {
      "number" : "0.90.3",
      "build_hash" : "5c38d6076448b899d758f29443329571e2522410",
      "build_timestamp" : "2013-08-06T13:18:31Z",
      "build_snapshot" : false,
      "lucene_version" : "4.4"
      },
      "tagline" : "You Know, for Search"
      }

      Let's configure an Alfresco/CMIS Repository Connector in ManifoldCF. In the Repository Connections List, click the "Add a new connection" link. Then provide a name and description for your Alfresco/CMIS connection.  

      Select CMIS as the connection type, leave the authority as the default value and click continue.

       

      Keep the default values on the Throttling tab, then continue to Server tab configuration. Enter the AtomPub server values for Alfresco, then click save.

      Confirm that the Alfresco/CMIS connector is working by checking the Connection Status value. 

      Now let's configure an ElasticSearch output connector. In the Output Connections List, click the "Add a new output connection" link. Then provide a name and description for your ElasticSearch output connection. 

      Select ElasticSearch as the connection type, leave the authority as the default value and click continue.

      Keep the default values on the Throttling tab, then continue to Parameters tab. Populate the Server Location (URL) for ElasticSearch. By default the port number will be 9200. For each additional instance of ElasticSearch that you spin up, the port number will increment by 1 (Example: 9201). Next add an index name and index type of your choice. Remember these values because we will need them for executing queries against ElasticSearch. Finally, click save.

       

      Confirm that the ElasticSearch output connector is working by checking the Connection Status value. 

      Before we configure a job to crawl the Alfresco repository, lets add a sample document and specialize the type to "sample:invoice". Note that I've created a sample invoice content type in advance.   

      Because we're good Alfresco enthusiasts (right?) and we like to be thorough, let's test the CMIS search in the Admin Console's Node Browser. 

      We're in the home stretch now. Let's create a job to crawl Alfresco based on our CMIS query above and export to ElasticSearch. On the Job List page, click the "Add a new job" link and give your new job a name.  

      On the Connection tab, choose "ElasticSearch" as the Output connection, "Alfresco CMIS" as the repo connection, change the priority to "1 (Highest)" and change the "Start method to "Start even inside a schedule window. "

      Click the continue button and go to the Scheduling tab. Change the Schedule type to "Rescan documents dynamically" and leave the rest of the values as the defaults for now.  

      Let's skip over to the CMIS Query tab and enter our query..."SELECT * FROM sample:invoice" Finally, let's click save. 

      Now let's switch the to "Status and Job Management" page. 

      We see that our job hasn't started yet. So let's click the "Start" link to force the job to start outside of the regular schedule window. After a few moments, we should see the Documents and Processed counts increment by 1.  

      Now its time to query ElasticSearch to see if our document were indexed as expected. With the following syntax we're going to query ElasticSearch in the index name = massnerder, the index type = invoices, limit the columns returned to cmis:name, and filter the search to only return results where cmis:name starts with the word Invoice.

      curl -XGET http://localhost:9200/massnerder/invoices/_search -d '
      {
      "fields" : ["cmis:name"],
      "query" : {
      "query_string" : {
      "query":"Invoice*"
      }
      }
      }
      '

      Huzzah! We see that the query above returns one hit in the following JSON response: 

      {
      "_shards": {
      "failed": 0,
      "successful": 5,
      "total": 5
      },
      "hits": {
      "hits": [
      {
      "_id": "http://localhost:8080/alfresco/cmisatom/1699d3d2-890b-4b0b-9780-09fa8a487562/content/Invoice01.tif?id=workspace%3A%2F%2FSpacesStore%2F0400fdcd-78fd-45bc-8f49-24a53c26182c%3B1.0",
      "_index": "massnerder",
      "_score": 1.0,
      "_type": "invoices",
      "fields": {
      "cmis:name": "Invoice01.tif"
      }
      }
      ],
      "max_score": 1.0,
      "total": 1
      },
      "timed_out": false,
      "took": 4
      }

      For kicks and giggles, let's spin up another instance of ElasticSearch and query the second node. Below is the command line output from starting our second node:

      ./elasticsearch -f
      [2013-08-19 17:30:16,413][INFO ][node ] [Jerry Jaxon] version[0.90.3], pid[94436], build[5c38d60/2013-08-06T13:18:31Z]
      [2013-08-19 17:30:16,413][INFO ][node ] [Jerry Jaxon] initializing ...
      [2013-08-19 17:30:16,418][INFO ][plugins ] [Jerry Jaxon] loaded [], sites []
      [2013-08-19 17:30:18,255][INFO ][node ] [Jerry Jaxon] initialized
      [2013-08-19 17:30:18,255][INFO ][node ] [Jerry Jaxon] starting ...
      [2013-08-19 17:30:18,339][INFO ][transport ] [Jerry Jaxon] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/X.X.X.X:9301]}
      [2013-08-19 17:30:21,398][INFO ][cluster.service ] [Jerry Jaxon] detected_master [Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]], added {[Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]],}, reason: zen-disco-receive(from master [[Ogre][InMryYlLS_2MW9HCtu8i7Q][inet[/X.X.X.X:9300]]])
      [2013-08-19 17:30:21,417][INFO ][discovery ] [Jerry Jaxon] elasticsearch/JP8HydAVSWmMUbpfaTEZ4Q
      [2013-08-19 17:30:21,421][INFO ][http ] [Jerry Jaxon] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/X.X.X.X:9201]}
      [2013-08-19 17:30:21,421][INFO ][node ] [Jerry Jaxon] started

      In the logging output from the first/master ElasticSearch node, we can see that our new node has joined the cluster:  

      [2013-08-19 17:30:21,389][INFO ][cluster.service          ] [Ogre] added {[Jerry Jaxon][JP8HydAVSWmMUbpfaTEZ4Q][inet[/X.X.X.X:9301]],}, reason: zen-disco-receive(join from node[[Jerry Jaxon][JP8HydAVSWmMUbpfaTEZ4Q][inet[/X.X.X.X:9301]]])

      Now let's execute a query on the second node to see if the indices have replicate properly:

      curl -XGET http://localhost:9201/massnerder/invoices/_search -d '
      {
      "fields" : ["cmis:name"],
      "query" : {
      "query_string" : {
      "query":"Invoice*"
      }
      }
      }
      '

      And just as expected, we received the one hit from the second ElasticSearch node: 

      {
      "_shards": {
      "failed": 0,
      "successful": 5,
      "total": 5
      },
      "hits": {
      "hits": [
      {
      "_id": "http://localhost:8080/alfresco/cmisatom/1699d3d2-890b-4b0b-9780-09fa8a487562/content/Invoice01.tif?id=workspace%3A%2F%2FSpacesStore%2F0400fdcd-78fd-45bc-8f49-24a53c26182c%3B1.0",
      "_index": "massnerder",
      "_score": 1.0,
      "_type": "invoices",
      "fields": {
      "cmis:name": "Invoice01.tif"
      }
      }
      ],
      "max_score": 1.0,
      "total": 1
      },
      "timed_out": false,
      "took": 56
      }

      Tadaaaa! With absolutely zero coding and a little configuration we used Apache ManifoldCF to crawl Alfresco and exported the results to ElasticSearch. That wasn't so hard was it? At this point we dont have Alfresco's SearchService pointing to ElasticSearch, so we can't leverage ElasticSearch within the Alfresco Share UI or the public services. Other than that, I think that this was pretty awesome nonetheless. 

       

      I want to give a few shout outs to my friends at Simflofy . They've built their product on top of ManifoldCF and have created some additional features that provide a much more polished end-to-end enterprise product, so take a gander when you get a chance.

      In addition, you can find another blog post on Alfresco, ManifoldCF, and ElasticSearch from one of our top Alfresco partners (Zaizi) here: 

      http://www.zaizi.com/blog/the-next-search-generation-in-alfresco-is-here 

      Finally, I want to give a shout out to my colleague Maurizio Pillitu and the SourceSense team that is working on the Alfresco Web Script repository connection for ManifoldCF. If you would like to contribute, you can find the GitHub project here:

      https://github.com/maoo/alfresco-webscript-manifold-connector 

      Stay tuned for Part 2 where I will build on top of what we accomplished in this article and add Google Drive and file system connectors into the mix. 

      - Keep Calm and Nerd On