Alfresco + Apache ManifoldCF + ElasticSearch: Part 2

This post will build on Alfresco + Apache ManifoldCF + ElasticSearch: Part 1. Specifically, we're going to add a Google Drive connector and a file system connector. The main goal for this post is to better demonstrate how we can aggregate indices from multiple repositories and how we can possibly satisfy federated search requirements. 

Before we can integrate Google Drive with ManifoldCF, we'll need to complete a few pre-requisites:

  1. Create a new Google API project
  2. Enable the Google Drive API and SDK services
  3. Create a new Client Id
  4. Generate a long-living Refresh Token  

 

Create a new Google API project

Start by navigating to Google Code API's Console. If you haven't created an API project before, then you'll be presented with the following screen:

Enable the Google Drive API and SDK Services

After creating your project, navigate to the Services section in the left-hand menu. In the services section, enable the Drive API and Drive SDK options. 

Create a new Client Id

Switch to the API Access menu, then click the Create an OAuth 2.0 client ID button. Name your project and click the next button:

Choose "Web application" for the application type, change the https dropdown to http, change the hostname to localhost, and click the Create client ID button:

In the following screen, click the Edit Settings...link and add https://developers.google.com/oauthplayground to the redirect URIs list. Record the Client ID and Client Secret values for later use:

Generate a long-living Refresh Token

If we've done everything correctly, we shouldn't need to go back to the Google API's Console. The next stop is the Google Developers OAuth 2.0 Playground

Once your'e in the OAuth Playground, click the settings/gear/cog button on the right-hand side. Check the "Use your own OAuth credentials" checkbox and enter the Client Id and Client Secret values we recorded from the Google Code API's console.    

In the "Step 1 Select & authorize APIs" menu, scroll down to Drive API v2, expand that menu option, select "https://www.googleapis.com/auth/drive.readonly," and click the Authorize APIs button. When you are prompted to allow access for the Apache ManifoldCF application to your Google Drive account, choose Accept.     

At this point, an Authorization code should have been automatically generated. Now we can generate a long-living Refresh Token by clicking the "Exchange authorization code for tokens" button. Record the Refresh Token value for use in the ManifoldCF configuration. 

Now that the pre-requisites are sorted, we can move onto configuring our Google Drive repository connection in ManifoldCF. In the Repository Connections List, click the "Add new connection" link. Provide a name and description for your Google Drive connection.  

On the Type tab, select GoogleDrive as connection type, leave the authority type as the default value and click continue.   

Keep the default values on the Throttling tab, then continue to the Server tab. Enter the values for the RefreshToken, Client Id, and the Client Secret Id that we previously generated, then click the save button.   

Let's go back to the Repository Connections List, click the "Add new connection" link. Provide a name and description for our File System connection.  

On the Type tab, select File as connection type, leave the authority type as the default value and click continue. 

Keep the default values on the Throttling tab, then click the save button and our File System connection should be ready to go.

Before we create jobs to crawl Google Drive and the file system, lets prepare some sample invoice documents. 

Alright. Now let's create a new job to crawl Google Drive and export to ElasticSearch.  On the Job List page, click the "Add a new job" link and give your new job a name.

On the Connection tab, choose "ElasticSearch" as the Output connection, "Google Drive" as the repo connection, change the priority to "1 (Highest)" and change the "Start method to "Start even inside a schedule window."

Click the continue button and go to the Scheduling tab. Change the Schedule type to "Rescan documents dynamically" and leave the rest of the values as the defaults for now. 

Keep the defaults for every other tab and move to the See Query tab. Let's enter a query that will search for documents where the title contains "Invoice", the mimetype is equal to "image/tiff" and lastly let's exclude documents in the trash folder. For more on Google Drive's query syntax, refer to the following documentation:

 https://developers.google.com/drive/search-parameters

Finally, click save.

Our Google Drive job hasn't started yet. So let's click the "Start" link to force the job to start outside of the regular schedule window. After a few moments, we should see the Documents and Processed counts increment by 1.

We've got one more job to configure. So let's create a new job to crawl the file system and export to ElasticSearch. On the Job List page, click the "Add a new job" link and give your new job a name.

On the Connection tab, choose "ElasticSearch" as the Output connection, "File System" as the repo connection, change the priority to "1 (Highest)" and change the "Start method to "Start even inside a schedule window."

Click the continue button and go to the Scheduling tab. Change the Schedule type to "Rescan documents dynamically" and leave the rest of the values as the defaults for now.

Switch over to the Repository Paths tab, add the path to your invoices directory on the file system. "/Developer/Alfresco/alfresco-manifoldcf-elasticsearch/Invoices" in my case. Remove the option to include folders since our invoice directory hierarchy is flat. Finally, click the save button.   

Now let's switch the to "Status and Job Management" page for the last time.

Our File System - ElasticSearch Job hasn't started yet. So let's click the "Start" link to force the job to start outside of the regular schedule window. After a few moments, we should see the Documents and Processed counts increment.  

Time to query ElasticSearch again. This time, we're going to add an extra "title" column for our Google Drive document. 

curl -XGET http://localhost:9200/massnerder/invoices/_search -d '
{
"fields" : ["cmis:name", "title"],
"query" : {
"query_string" : {
"query":"Invoice*"
}
}
}
'

VoilĂ !!!

{
"_shards": {
"failed": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
"_id": "https://doc-14-18-docs.googleusercontent.com/docs/securesc/amkuuenev7losasm7hb5lnkpvd88ufog/vd21rehnm9e62dbi10b4eiefjkhjlhvt/1377482400000/06874029851172517131/06874029851172517131/0BxcKfzVaCdtBemdpRDVuQlM4V0k?h=10387061918782038444&e=download&gd=true",
"_index": "massnerder",
"_score": 1.0,
"_type": "invoices",
"fields": {
"title": "Invoice02.tif"
}
},
{
"_id": "http://localhost:8080/alfresco/cmisatom/1699d3d2-890b-4b0b-9780-09fa8a487562/content/Invoice01.tif?id=workspace%3A%2F%2FSpacesStore%2F0400fdcd-78fd-45bc-8f49-24a53c26182c%3B1.0",
"_index": "massnerder",
"_score": 1.0,
"_type": "invoices",
"fields": {
"cmis:name": "Invoice01.tif"
}
},
{
"_id": "file:/Developer/Alfresco/alfresco-manifoldcf-elasticsearch/Invoices/Invoice03.tif",
"_index": "massnerder",
"_score": 1.0,
"_type": "invoices"
}
],
"max_score": 1.0,
"total": 3
},
"timed_out": false,
"took": 2
}

Just to recap...we crawled Alfresco, Google Drive, and a File System directory for invoices and exported them to a single ElasticSearch instance. Now that's was extremely easy and pretty darn awesome.

- Keep Calm and Nerd On