Create a content connector

Acontent connector is a software program used to traverse the data in anenterprise's repository and populate a data source. Google provides the followingoptions for developing content connectors:

  • The Content Connector SDK. This is a good option if you are programmingin Java. The Content Connector SDK is a wrapper aroundthe REST API allowing you to quickly create connectors. To create a contentconnector using the SDK, refer toCreate a content connector using the Content Connector SDK.

  • A low-level REST API or API libraries. Use these options if you're notprogramming in Java, or if your codebase better accommodates aREST API or a library. To create a content connector using the REST API, refertoCreate a content connector using the REST API.

A typical content connector performs the following tasks:

  1. Reads and processes configuration parameters.
  2. Pulls discrete chunks of indexable data, called "items," from the third-partycontent repository.
  3. Combines ACLs, metadata, and content data into indexable items.
  4. Indexes items to the Cloud Search data source.
  5. (optional) Listens to change notifications from the third-party contentrepository. Change notifications are converted into indexing requests to keepthe Cloud Search data source in sync with the third-party repository. Theconnector only performs this task if the repository supports change detection.

Create a content connector using the Content Connector SDK

The following sections explain how to create a content connector using theContent Connector SDK.

Set up dependencies

You must include certain dependencies in your build file to use the SDK. Clickon a tab below to view the dependencies for your build environment:

Maven

<dependency><groupId>com.google.enterprise.cloudsearch</groupId><artifactId>google-cloudsearch-indexing-connector-sdk</artifactId><version>v1-0.0.3</version></dependency>

Gradle

compile group: 'com.google.enterprise.cloudsearch',        name: 'google-cloudsearch-indexing-connector-sdk',        version: 'v1-0.0.3'

Create your connector configuration

Every connector has a configuration file containing parameters used by theconnector, such as the ID for your repository. Parameters are defined askey-value pairs, such asapi.sourceId=1234567890abcdef.

The Google Cloud Search SDK contains several Google-supplied configurationparameters used by all connectors. You must declare the followingGoogle-supplied parameters in your configuration file:

  • For a content connector, you must declareapi.sourceId andapi.serviceAccountPrivateKeyFile as these parameters identify the locationof your repository and private key needed to access the repository.
Note: If your private key file is not a JSON key, you must also overrideapi.serviceAccountId.
  • For an identity connector, you must declareapi.identitySourceId as thisparameter identifies the location of your external identity source. If you aresyncing users, you must also declareapi.customerId as the unique ID foryour enterprise's Google Workspace account.

Unless you want to override the default values of other Google-suppliedparameters, you do not need to declare them in your configuration file.For additional information on the Google-supplied configuration parameters, suchas how to generate certain IDs and keys, refer toGoogle-supplied configuration parameters.

You can also define your own repository-specific parameters for use in yourconfiguration file.

Note: There is no strict naming requirement for the connectorproperties file, but we recommend saving the file using a.propertiesor.config extension.

Pass the configuration file to the connector

Set the system propertyconfig to pass the configuration file to yourconnector. You can set the property using the-D argument when startingthe connector. For example, the following command starts the connectorwith theMyConfig.properties configuration file:

java-classpathmyconnector.jar;...-Dconfig=MyConfig.propertiesMyConnector

If this argument is missing, the SDK attempts to access a default configurationfile namedconnector-config.properties.

Determine your traversal strategy

The primary function of a content connector is to traverse a repository andindex its data. You must implement a traversal strategy based on the size andlayout of data in your repository. You can design your own strategy or choosefrom the following strategies implemented in the SDK:

Full traversal strategy

A full traversal strategy scans the entire repository and blindly indexesevery item. This strategy is commonly used when you have a small repository andcan afford the overhead of doing a full traversal every time you index.

This traversal strategy is suitable for small repositories with mostlystatic, non-hierarchical, data. You might also use this traversal strategywhen change detection is difficult or not supported by the repository.

List traversal strategy

A list traversal strategy scans the entire repository, including all childnodes, determining the status of each item. Then, the connector takes a secondpass and only indexes items that are new or have been updated since the lastindexing. This strategy is commonly used to perform incrementalupdates to an existing index (instead of having to do a full traversal everytime you update the index).

This traversal strategy is suitable when change detection is difficult ornot supported by the repository, you have non-hierarchical data, and you areworking with very large data sets.

Graph traversal

A graph traversal strategy scans the entire parent node determining thestatus of each item. Then, the connector takes a second pass and only indexesitems in the root node are new or have been updated since the last indexing.Finally, the connector passes any child IDs then indexes items in the child nodesthat are new or have been updated. The connector continues recursively throughall child nodes until all items have been addressed. Such traversal is typicallyused for hierarchical repositories where listing of all IDs isn'tpractical.

This strategy is suitable if you have hierarchical data that needs to becrawled, such as a series of directories or web pages.

Note: The terms “item” and “document” are synonymous in this document andsample code.

Each of these traversal strategies is implemented by a template connectorclass in the SDK. While you can implement your own traversal strategy, thesetemplates greatly speed up the development of your connector. Tocreate a connector using a template, proceeed to the section corresponding toyour traversal strategy:

Create a full traversal connector using a template class

This section of the docs refers to code snippets from theFullTraversalSample example.

Implement the connector’s entry point

The entry point to a connector is themain() method. This method’s primary task is to create an instance of theApplicationclass and invoke itsstart()method to run the connector.

Before callingapplication.start(),use theIndexingApplication.Builderclass to instantiate theFullTraversalConnectortemplate. TheFullTraversalConnectoraccepts aRepositoryobject whose methods you implement. The following code snippet shows howto implement themain() method:

FullTraversalSample.java
/** * This sample connector uses the Cloud Search SDK template class for a full * traversal connector. * * @param args program command line arguments * @throws InterruptedException thrown if an abort is issued during initialization */publicstaticvoidmain(String[]args)throwsInterruptedException{Repositoryrepository=newSampleRepository();IndexingConnectorconnector=newFullTraversalConnector(repository);IndexingApplicationapplication=newIndexingApplication.Builder(connector,args).build();application.start();}

Behind the scenes, the SDK calls theinitConfig()method after your connector’smain() method callsApplication.build.TheinitConfig() methodperforms the following tasks:

  1. Calls theConfiguation.isInitialized()method to ensure that theConfigurationhasn’t been initialized.
  2. Initializes aConfiguration object with the Google-supplied key-valuepairs. Each key-value pair is stored in aConfigValueobject within theConfiguration object.

Implement theRepository interface

The sole purpose of theRepository object is to perform the traversal andindexing of repository items. When usinga template, you need only override certain methods within theRepositoryinterface to create a content connector. The methods you override depend on thetemplate and traversal strategy you use. For theFullTraversalConnector, override the following methods:

  • Theinit()method. To perform any data repository set-up and initialization, override theinit() method.

  • ThegetAllDocs()method. To traverse and index all items in the data repository, override thegetAllDocs() method. This method is called once for each scheduled traversal(as defined by your configuration).

  • (optional) ThegetChanges()method. If your repository supports change detection, override thegetChanges() method. This method is called once for each scheduled incrementaltraversal (as defined by your configuration) to retrieve modified items andindex them.

  • (optional) Theclose()method. If you need to perform repository cleanup, override theclose()method. This method is called once during shutdown of the connector.

Each of the methods of theRepository object returns some type ofApiOperationobject. AnApiOperation object performs an action in the form of a single, orperhaps multiple,IndexingService.indexItem()calls to perform the actual indexing of your repository.

Get custom configuration parameters

As part of handling your connector’s configuration, you will need to get anycustom parameters from theConfigurationobject. This task is usually performed in aRepositoryclass'sinit() method.

TheConfiguration class has several methods for getting different data typesfrom a configuration. Each method returns aConfigValue object. You will thenuse theConfigValue object’sget()method to retrieve the actual value.The following snippet, fromFullTraversalSample,shows how to retrieve asingle custom integer value from aConfiguration object:

FullTraversalSample.java
@Overridepublicvoidinit(RepositoryContextcontext){log.info("Initializing repository");numberOfDocuments=Configuration.getInteger("sample.documentCount",10).get();}

To get and parse a parameter containing several values, use one of theConfiguration class's type parsers to parse the data into discrete chunks.The following snippet, from the tutorial connector uses thegetMultiValuemethod to get a list GitHub repository names:

GithubRepository.java
ConfigValue<List<String>>repos=Configuration.getMultiValue("github.repos",Collections.emptyList(),Configuration.STRING_PARSER);

Perform a full traversal

OverridegetAllDocs()to perform a full traversal and index your repository. ThegetAllDocs()method accepts a checkpoint. The checkpoint is used to resume indexing at aspecific item should the process be interrupted. For each item in yourrepository, perform these steps in thegetAllDocs()method:

  1. Set permissions.
  2. Set the metadata for the item that you are indexing.
  3. Combine the metadata and item into one indexableRepositoryDoc.
  4. Package each indexable item into an iterator returned by thegetAllDocs()method. Note thatgetAllDocs() actually returns aCheckpointCloseableIterablewhich is an iteration ofApiOperationobjects, each object representing an API request performed on aRepositoryDoc, such as indexing it.

If the set of items is too large to process in a single call, include acheckpoint and sethasMore(true)to indicate more items are available for indexing.

Set the permissions for an item

Your repository uses anAccess Control List (ACL) to identify the users orgroups that have access to an item. An ACL is a list of IDs for groups or userswho can access the item.

You must duplicate the ACL used by your repository to ensure only those userswith access to an item can see that item within a search result. TheACL for an item must be included when indexing an item so that Google Cloud Search has the information it needs to provide the correct level of access tothe item.

The Content Connector SDK provides a rich set of ACL classes and methods tomodel the ACLs of most repositories. You must analyze the ACL for each item inyour repository and create a corresponding ACL for Google Cloud Search when youindex an item. If your repository’s ACL employs concepts such as ACLinheritance, modeling that ACL can be tricky. For further information on GoogleCloud Search ACLs, refer toGoogle Cloud Search ACLs.

Note: The Cloud Search Indexing API supports single-domain ACLs. It does notsupport cross-domain ACLs. Use theAcl.Builderclass to set access to each item using an ACL. The following code snippet, takenfrom the full traversal sample, allowsall users or “principals”(getCustomerPrincipal())to be “readers” of all items(.setReaders())when performing a search.

FullTraversalSample.java
// Make the document publicly readable within the domainAclacl=newAcl.Builder().setReaders(Collections.singletonList(Acl.getCustomerPrincipal())).build();

You need to understand ACLs to properly model ACLs for the repository. Forexample, you might be indexing files within a file system thatuses some sort of inheritance model whereby child folders inherit permissionsfrom parent folders. Modeling ACL inheritance requires additional informationcovered inGoogle Cloud Search ACLs

Set the metadata for an item

Metadata is stored in anItem object. To create anItem, you need aminimum of a unique string ID, item type, ACL, URL, and version for the item.The following code snippet shows how to build anItem using theIndexingItemBuilderhelper class.

FullTraversalSample.java
// Url is required. Use google.com as a placeholder for this sample.StringviewUrl="https://www.google.com";// Version is required, set to current timestamp.byte[]version=Longs.toByteArray(System.currentTimeMillis());// Using the SDK item builder class to create the document with appropriate attributes// (this can be expanded to include metadata fields etc.)Itemitem=IndexingItemBuilder.fromConfiguration(Integer.toString(id)).setItemType(IndexingItemBuilder.ItemType.CONTENT_ITEM).setAcl(acl).setSourceRepositoryUrl(IndexingItemBuilder.FieldOrValue.withValue(viewUrl)).setVersion(version).build();

Create the indexable item

Once you have set the metadata for the item, you can create the actual indexableitem using theRepositoryDoc.Builderclass. The following example shows how to create a single indexable item.

FullTraversalSample.java
// For this sample, content is just plain textStringcontent=String.format("Hello world from sample doc %d",id);ByteArrayContentbyteContent=ByteArrayContent.fromString("text/plain",content);// Create the fully formed documentRepositoryDocdoc=newRepositoryDoc.Builder().setItem(item).setContent(byteContent,IndexingService.ContentFormat.TEXT).build();

ARepositoryDoc is a type ofApiOperation that performs the actualIndexingService.indexItem() request.

You can also use thesetRequestMode() method of theRepositoryDoc.Builderclass to identify the indexing request asASYNCHRONOUS orSYNCHRONOUS:

ASYNCHRONOUS
Asynchronous mode results in longer indexing-to-serving latency andaccommodates large throughput quota for indexing requests. Asynchronous mode isrecommended for initial indexing (backfill) of the entire repository.
SYNCHRONOUS
Synchronous mode results in shorter indexing-to-serving latency andaccommodates limited throughput quota. Synchronous mode isrecommended for indexing of updates and changes to the repository. Ifunspecified, the request mode defaults toSYNCHRONOUS.

Package each indexable item in an iterator

ThegetAllDocs()method returns anIterator, specifically aCheckpointCloseableIterable,ofRepositoryDocobjects. You can use theCheckpointClosableIterableImpl.Builderclass to construct and return an iterator. The following code snippet shows howto construct and return an iterator.

FullTraversalSample.java
CheckpointCloseableIterable<ApiOperation>iterator=newCheckpointCloseableIterableImpl.Builder<>(allDocs).build();

The SDK executes each indexing call enclosed within the iterator.

Note: Cloud Search indexes the meta data for all files. Depending on the filetype of an item, Cloud Search also indexes the text content, including usingOptical Character Recognition (OCR) on image and other files. For a list offile types supporting text extraction, refer toSupported file types for text extraction.

Next Steps

Here are a few next steps you might take:

Create a list traversal connector using a template class

The Cloud Search Indexing Queue is used to hold IDs and optional hashvalues for each item in the repository. A list traversal connector pushesitem IDs to the Google Cloud Search Indexing Queue and retrieves them one at atime for indexing. Google Cloud Search maintains queues andcompare queue contents to determine item status, such as whether an item hasbeen deleted from the repository. For further information on the Cloud SearchIndexing Queue, refer toThe Cloud Search Indexing Queue.

This section of the docs refers to code snippets from theListTraversalSampleexample.

Implement the connector’s entry point

The entry point to a connector is themain() method. This method’s primary task is to create an instance of theApplicationclass and invoke itsstart()method to run the connector.

Before callingapplication.start(),use theIndexingApplication.Builderclass to instantiate theListingConnectortemplate. TheListingConnector accepts aRepositoryobject whose methods you implement. The following snippet shows how toinstantiate theListingConnector and its associatedRepository:

ListTraversalSample.java
/** * This sample connector uses the Cloud Search SDK template class for a * list traversal connector. * * @param args program command line arguments * @throws InterruptedException thrown if an abort is issued during initialization */publicstaticvoidmain(String[]args)throwsInterruptedException{Repositoryrepository=newSampleRepository();IndexingConnectorconnector=newListingConnector(repository);IndexingApplicationapplication=newIndexingApplication.Builder(connector,args).build();application.start();}

Behind the scenes, the SDK calls theinitConfig()method after your connector’smain() method callsApplication.build.TheinitConfig() method:

  1. Calls theConfiguation.isInitialized()method to ensure that theConfigurationhasn’t been initialized.
  2. Initializes aConfiguration object with the Google-supplied key-valuepairs. Each key-value pair is stored in aConfigValueobject within theConfiguration object.

Implement theRepository interface

The sole purpose of theRepository object is to perform the traversal andindexing of repository items. When using a template, you need only overridecertain methods within theRepository interface to create a content connector.The methods you override depend on the template and traversal strategy you use. For theListingConnector, override the following methods:

  • Theinit()method. To perform any data repository set-up and initialization, override theinit() method.

  • ThegetIds()method. To retrieve IDs and hash values for all records in the repository,override thegetIds() method.

  • ThegetDoc()method. To add new, update, modify, or delete items from the index, override thegetDoc() method.

  • (optional) ThegetChanges()method. If your repository supports change detection, override thegetChanges() method. This method is called once for each scheduled incrementaltraversal (as defined by your configuration) to retrieve modified items andindex them.

  • (optional) Theclose()method. If you need to perform repository cleanup, override theclose()method. This method is called once during shutdown of the connector.

Each of the methods of theRepository object returns some type ofApiOperationobject. AnApiOperation object performs an action in the form of a single, orperhaps multiple,IndexingService.indexItem()calls to perform the actual indexing of your repository.

Get custom configuration parameters

As part of handling your connector’s configuration, you will need to get anycustom parameters from theConfigurationobject. This task is usually performed in aRepositoryclass'sinit() method.

TheConfiguration class has several methods for getting different data typesfrom a configuration. Each method returns aConfigValue object. You will thenuse theConfigValue object’sget()method to retrieve the actual value.The following snippet, fromFullTraversalSample,shows how to retrieve asingle custom integer value from aConfiguration object:

FullTraversalSample.java
@Overridepublicvoidinit(RepositoryContextcontext){log.info("Initializing repository");numberOfDocuments=Configuration.getInteger("sample.documentCount",10).get();}

To get and parse a parameter containing several values, use one of theConfiguration class's type parsers to parse the data into discrete chunks.The following snippet, from the tutorial connector uses thegetMultiValuemethod to get a list GitHub repository names:

GithubRepository.java
ConfigValue<List<String>>repos=Configuration.getMultiValue("github.repos",Collections.emptyList(),Configuration.STRING_PARSER);

Perform the list traversal

OverridegetIds()method to retrieve IDs and hash values for all records in the repository.ThegetIds() method accepts a checkpoint. The checkpoint is used to resumeindexing at a specific item should the process be interrupted.

Next, override thegetDoc()method to handle each item in the Cloud Search Indexing Queue.

Push item IDs and hash values

OverridegetIds()to fetch the item IDs and their associated content hash values from therepository. ID and hash value pairs are then packaged into push operationrequest to the Cloud Search Indexing Queue. Root or parent IDs are typicallypushed first followed by child IDs until the entire hierarchy of items has beenprocessed.

ThegetIds() method accepts a checkpoint representing the last item to beindexed. The checkpoint can be used to resume indexing at a specific item shouldthe process be interrupted. For each item in your repository, perform thesesteps in thegetIds() method:

  • Get each item ID and associated hash value from the repository.
  • Package each ID and hash value pair into aPushItems.
  • Combine eachPushItems into an iterator returned by thegetIds()method. Note thatgetIds() actually returns aCheckpointCloseableIterablewhich is an iteration ofApiOperationobjects, each object representing an API request performed on aRepositoryDoc, such as push the items to the queue.

The following code snippet shows how to get each item ID and hash value andinsert them into aPushItems.APushItems is anApiOperation request to push an item to the Cloud SearchIndexing Queue.

ListTraversalSample.java
PushItems.BuilderallIds=newPushItems.Builder();for(Map.Entry<Integer,Long>entry:this.documents.entrySet()){StringdocumentId=Integer.toString(entry.getKey());Stringhash=this.calculateMetadataHash(entry.getKey());PushItemitem=newPushItem().setMetadataHash(hash);log.info("Pushing "+documentId);allIds.addPushItem(documentId,item);}

The following code snippet shows how to use thePushItems.Builderclass to package the IDs and hash values into a single pushApiOperation.

ListTraversalSample.java
ApiOperationpushOperation=allIds.build();CheckpointCloseableIterable<ApiOperation>iterator=newCheckpointCloseableIterableImpl.Builder<>(Collections.singletonList(pushOperation)).build();returniterator;

Items are pushed to the Cloud Search Indexing Queue for further processing.

Retrieve and handle each item

OverridegetDoc() to handle each item in the Cloud Search Indexing Queue.An item can be new, modified, unchanged, or can no longer exist in the sourcerepository. Retrieve and index each item that is new or modified. Remove itemsfrom the index that no longer exist in the source repository.

ThegetDoc() method accepts an Item from the Google Cloud SearchIndexing Queue. For each item in the queue, perform these steps in thegetDoc() method:

  1. Check if the item’s ID, within the Cloud Search Indexing Queue, existsin the repository. If not, delete the item from the index.

  2. Poll the index for item status and, if an item unchanged (ACCEPTED), don’tdo anything.

  3. Index changed or new items:

    1. Set the permissions.
    2. Set the metadata for the item that you are indexing.
    3. Combine the metadata and item into one indexableRepositoryDoc.
    4. Return theRepositoryDoc.

Note: TheListingConnector template does't support returningnull onthegetDoc() method. Returningnull results in aNullPointerException.

Handle deleted items

The following code snippet shows how to determine if an item exists in therepository and, if not, delete it.

ListTraversalSample.java
StringresourceName=item.getName();intdocumentId=Integer.parseInt(resourceName);if(!documents.containsKey(documentId)){// Document no longer exists -- delete itlog.info(()->String.format("Deleting document %s",item.getName()));returnApiOperations.deleteItem(resourceName);}

Note thatdocuments is a data structure representing the repository. IfdocumentID is not found indocuments, returnAPIOperations.deleteItem(resourceName)to delete the item from the index.

Handle unchanged items

The following code snippet shows how to poll item status in the Cloud SearchIndexing Queue and handle an unchanged item.

ListTraversalSample.java
StringcurrentHash=this.calculateMetadataHash(documentId);if(this.canSkipIndexing(item,currentHash)){// Document neither modified nor deleted, ack the pushlog.info(()->String.format("Document %s not modified",item.getName()));PushItempushItem=newPushItem().setType("NOT_MODIFIED");returnnewPushItems.Builder().addPushItem(resourceName,pushItem).build();}

To determine if the item is unmodified, check the status of the item as wellas other metadata that may indicate a change. In the example, the metadatahash is used to determine if the item has been changed.

ListTraversalSample.java
/** * Checks to see if an item is already up to date * * @param previousItem Polled item * @param currentHash  Metadata hash of the current github object * @return PushItem operation */privatebooleancanSkipIndexing(ItempreviousItem,StringcurrentHash){if(previousItem.getStatus()==null||previousItem.getMetadata()==null){returnfalse;}Stringstatus=previousItem.getStatus().getCode();StringpreviousHash=previousItem.getMetadata().getHash();return"ACCEPTED".equals(status)      &&previousHash!=null      &&previousHash.equals(currentHash);}

Set the permissions for an item

Your repository uses anAccess Control List (ACL) to identify the users orgroups that have access to an item. An ACL is a list of IDs for groups or userswho can access the item.

You must duplicate the ACL used by your repository to ensure only those userswith access to an item can see that item within a search result. TheACL for an item must be included when indexing an item so that Google Cloud Search has the information it needs to provide the correct level of access tothe item.

The Content Connector SDK provides a rich set of ACL classes and methods tomodel the ACLs of most repositories. You must analyze the ACL for each item inyour repository and create a corresponding ACL for Google Cloud Search when youindex an item. If your repository’s ACL employs concepts such as ACLinheritance, modeling that ACL can be tricky. For further information on GoogleCloud Search ACLs, refer toGoogle Cloud Search ACLs.

Note: The Cloud Search Indexing API supports single-domain ACLs. It does notsupport cross-domain ACLs. Use theAcl.Builderclass to set access to each item using an ACL. The following code snippet, takenfrom the full traversal sample, allowsall users or “principals”(getCustomerPrincipal())to be “readers” of all items(.setReaders())when performing a search.

FullTraversalSample.java
// Make the document publicly readable within the domainAclacl=newAcl.Builder().setReaders(Collections.singletonList(Acl.getCustomerPrincipal())).build();

You need to understand ACLs to properly model ACLs for the repository. Forexample, you might be indexing files within a file system thatuses some sort of inheritance model whereby child folders inherit permissionsfrom parent folders. Modeling ACL inheritance requires additional informationcovered inGoogle Cloud Search ACLs

Set the metadata for an item

Metadata is stored in anItem object. To create anItem, you need aminimum of a unique string ID, item type, ACL, URL, and version for the item.The following code snippet shows how to build anItem using theIndexingItemBuilderhelper class.

ListTraversalSample.java
// Url is required. Use google.com as a placeholder for this sample.StringviewUrl="https://www.google.com";// Version is required, set to current timestamp.byte[]version=Longs.toByteArray(System.currentTimeMillis());// Set metadata hash so queue can detect changesStringmetadataHash=this.calculateMetadataHash(documentId);// Using the SDK item builder class to create the document with// appropriate attributes. This can be expanded to include metadata// fields etc.Itemitem=IndexingItemBuilder.fromConfiguration(Integer.toString(documentId)).setItemType(IndexingItemBuilder.ItemType.CONTENT_ITEM).setAcl(acl).setSourceRepositoryUrl(IndexingItemBuilder.FieldOrValue.withValue(viewUrl)).setVersion(version).setHash(metadataHash).build();

Create an indexable item

Once you have set the metadata for the item, you can create the actual indexableitem using theRepositoryDoc.Builder.The following example shows how to create a single indexable item.

ListTraversalSample.java
// For this sample, content is just plain textStringcontent=String.format("Hello world from sample doc %d",documentId);ByteArrayContentbyteContent=ByteArrayContent.fromString("text/plain",content);// Create the fully formed documentRepositoryDocdoc=newRepositoryDoc.Builder().setItem(item).setContent(byteContent,IndexingService.ContentFormat.TEXT).build();

ARepositoryDocis a type ofApiOperation that performs the actualIndexingService.indexItem()request.

You can also use thesetRequestMode() method of theRepositoryDoc.Builderclass to identify the indexing request asASYNCHRONOUS orSYNCHRONOUS:

ASYNCHRONOUS
Asynchronous mode results in longer indexing-to-serving latency andaccommodates large throughput quota for indexing requests. Asynchronous mode isrecommended for initial indexing (backfill) of the entire repository.
SYNCHRONOUS
Synchronous mode results in shorter indexing-to-serving latency andaccommodates limited throughput quota. Synchronous mode isrecommended for indexing of updates and changes to the repository. Ifunspecified, the request mode defaults toSYNCHRONOUS.
Note: Cloud Search indexes the meta data for all files. Depending on the filetype of an item, Cloud Search also indexes the text content, including usingOptical Character Recognition (OCR) on hand-written content. For a list offile types supporting text extraction, refer toSupported file types for text extraction.

Next Steps

Here are a few next steps you might take:

Create a graph traversal connector using a template class

The Cloud Search Indexing Queue is used to hold IDs and optional hash valuesfor each item in the repository. A graph traversal connector pushes item IDs tothe Google Cloud Search Indexing Queue and retrieves them one at a time forindexing. Google Cloud Search maintains queues and compare queue contents todetermine item status, such as whether an item has been deleted from therepository. For further information on the Cloud Search Indexing Queue, refertoThe Google Cloud Search Indexing Queue.

During the index, the item content is fetched from the data repository and anychildren item IDs are pushed to the queue. The connector proceeds recursivelyprocessing parent and children IDs until all items are handled.

This section of the docs refers to code snippets from theGraphTraversalSampleexample.

Implement the connector’s entry point

The entry point to a connector is themain() method. This method’s primary task is to create an instance of theApplicationclass and invoke itsstart()method to run the connector.

Before callingapplication.start(),use theIndexingApplication.Builderclass to instantiate theListingConnector template. TheListingConnectoraccepts aRepositoryobject whose methods you implement.

Note: The graph traversal strategy uses the same template class,ListingConnector,as the list traversal strategy.

The following snippet shows how toinstantiate theListingConnector and its associatedRepository:

GraphTraversalSample.java
/** * This sample connector uses the Cloud Search SDK template class for a graph * traversal connector. * * @param args program command line arguments * @throws InterruptedException thrown if an abort is issued during initialization */publicstaticvoidmain(String[]args)throwsInterruptedException{Repositoryrepository=newSampleRepository();IndexingConnectorconnector=newListingConnector(repository);IndexingApplicationapplication=newIndexingApplication.Builder(connector,args).build();application.start();}

Behind the scenes, the SDK calls theinitConfig()method after your connector’smain() method callsApplication.build.TheinitConfig() method:

  1. Calls theConfiguation.isInitialized()method to ensure that theConfigurationhasn’t been initialized.
  2. Initializes aConfiguration object with the Google-supplied key-valuepairs. Each key-value pair is stored in aConfigValueobject within theConfiguration object.

Implement theRepository interface

The sole purpose of theRepository object is to perform the traversal and indexing of repositoryitems. When using a template, you need only override certain methods within theRepository interface to create a content connector. The methods you overridedepend on the template and traversal strategy you use. For theListingConnector,you override the following methods:

  • Theinit()method. To perform any data repository set-up and initialization, override theinit() method.

  • ThegetIds()method. To retrieve IDs and hash values for all records in the repository,override thegetIds() method.

  • ThegetDoc()method. To add new, update, modify, or delete items from the index, override thegetDoc() method.

  • (optional) ThegetChanges()method. If your repository supports change detection, override thegetChanges() method. This method is called once for each scheduled incrementaltraversal (as defined by your configuration) to retrieve modified items andindex them.

  • (optional) Theclose()method. If you need to perform repository cleanup, override theclose()method. This method is called once during shutdown of the connector.

Each of the methods of theRepository object returns some type ofApiOperation object. AnApiOperationobject performs an action in the form of a single, or perhaps multiple,IndexingService.indexItem()calls to perform the actual indexing of your repository.

Get custom configuration parameters

As part of handling your connector’s configuration, you will need to get anycustom parameters from theConfigurationobject. This task is usually performed in aRepositoryclass'sinit() method.

TheConfiguration class has several methods for getting different data typesfrom a configuration. Each method returns aConfigValue object. You will thenuse theConfigValue object’sget()method to retrieve the actual value.The following snippet, fromFullTraversalSample,shows how to retrieve asingle custom integer value from aConfiguration object:

FullTraversalSample.java
@Overridepublicvoidinit(RepositoryContextcontext){log.info("Initializing repository");numberOfDocuments=Configuration.getInteger("sample.documentCount",10).get();}

To get and parse a parameter containing several values, use one of theConfiguration class's type parsers to parse the data into discrete chunks.The following snippet, from the tutorial connector uses thegetMultiValuemethod to get a list GitHub repository names:

GithubRepository.java
ConfigValue<List<String>>repos=Configuration.getMultiValue("github.repos",Collections.emptyList(),Configuration.STRING_PARSER);

Perform the graph traversal

OverridegetIds()method to retrieve IDs and hash values for all records in the repository.ThegetIds() method accepts a checkpoint. The checkpoint is used to resumeindexing at a specific item should the process be interrupted.

Next, override thegetDoc()method to handle each item in the Cloud Search Indexing Queue.

Push item IDs and hash values

OverridegetIds()to fetch the item IDs and their associated content hash values from therepository. ID and hash value pairs are then packaged into push operationrequest to the Cloud Search Indexing Queue. Root or parent IDs are typicallypushed first followed by child IDs until the entire hierarchy of items has beenprocessed.

ThegetIds() method accepts a checkpoint representing the last item to beindexed. The checkpoint can be used to resume indexing at a specific item shouldthe process be interrupted. For each item in your repository, perform thesesteps in thegetIds() method:

  • Get each item ID and associated hash value from the repository.
  • Package each ID and hash value pair into aPushItems.
  • Combine eachPushItems into an iterator returned by thegetIds() method. Note thatgetIds() actually returns aCheckpointCloseableIterablewhich is an iteration ofApiOperationobjects, each object representing an API request performed on aRepositoryDoc, such as push the items to the queue.

The following code snippet shows how to get each item ID and hash value andinsert them into aPushItems. APushItems is anApiOperation request to push an item to the Cloud Search Indexing Queue.

GraphTraversalSample.java
PushItems.BuilderallIds=newPushItems.Builder();PushItemitem=newPushItem();allIds.addPushItem("root",item);

The following code snippet shows how to use thePushItems.Builderclass to package the IDs and hash values into a single pushApiOperation.

GraphTraversalSample.java
ApiOperationpushOperation=allIds.build();CheckpointCloseableIterable<ApiOperation>iterator=newCheckpointCloseableIterableImpl.Builder<>(Collections.singletonList(pushOperation)).build();

Items are pushed to the Cloud Search Indexing Queue for further processing.

Retrieve and handle each item

OverridegetDoc() to handle each item in the Cloud Search Indexing Queue.An item can be new, modified, unchanged, or can no longer exist in the sourcerepository. Retrieve and index each item that is new or modified. Remove itemsfrom the index that no longer exist in the source repository.

ThegetDoc() method accepts an Item from the Cloud Search IndexingQueue. For each item in the queue, perform these steps in thegetDoc() method:

  1. Check if the item’s ID, within the Cloud Search Indexing Queue, exists in therepository. If not, delete the item from the index. If the item does exist,continue with the next step.

  2. Index changed or new items:

    1. Set the permissions.
    2. Set the metadata for the item that you are indexing.
    3. Combine the metadata and item into one indexableRepositoryDoc.
    4. Place the child IDs in the Cloud Search Indexing Queue for further processing.
    5. Return theRepositoryDoc.

Handle deleted items

The following code snippet shows how to determine if an item exists in the indexand, it not, delete it.

GraphTraversalSample.java
StringresourceName=item.getName();if(documentExists(resourceName)){returnbuildDocumentAndChildren(resourceName);}// Document doesn't exist, delete itlog.info(()->String.format("Deleting document %s",resourceName));returnApiOperations.deleteItem(resourceName);

Set the permissions for an item

Your repository uses anAccess Control List (ACL) to identify the users orgroups that have access to an item. An ACL is a list of IDs for groups or userswho can access the item.

You must duplicate the ACL used by your repository to ensure only those userswith access to an item can see that item within a search result. TheACL for an item must be included when indexing an item so that Google Cloud Search has the information it needs to provide the correct level of access tothe item.

The Content Connector SDK provides a rich set of ACL classes and methods tomodel the ACLs of most repositories. You must analyze the ACL for each item inyour repository and create a corresponding ACL for Google Cloud Search when youindex an item. If your repository’s ACL employs concepts such as ACLinheritance, modeling that ACL can be tricky. For further information on GoogleCloud Search ACLs, refer toGoogle Cloud Search ACLs.

Note: The Cloud Search Indexing API supports single-domain ACLs. It does notsupport cross-domain ACLs. Use theAcl.Builderclass to set access to each item using an ACL. The following code snippet, takenfrom the full traversal sample, allowsall users or “principals”(getCustomerPrincipal())to be “readers” of all items(.setReaders())when performing a search.

FullTraversalSample.java
// Make the document publicly readable within the domainAclacl=newAcl.Builder().setReaders(Collections.singletonList(Acl.getCustomerPrincipal())).build();

You need to understand ACLs to properly model ACLs for the repository. Forexample, you might be indexing files within a file system thatuses some sort of inheritance model whereby child folders inherit permissionsfrom parent folders. Modeling ACL inheritance requires additional informationcovered inGoogle Cloud Search ACLs

Set the metadata for an item

Metadata is stored in anItem object. To create anItem, you need aminimum of a unique string ID, item type, ACL, URL, and version for the item.The following code snippet shows how to build anItem using theIndexingItemBuilderhelper class.

GraphTraversalSample.java
// Url is required. Use google.com as a placeholder for this sample.StringviewUrl="https://www.google.com";// Version is required, set to current timestamp.byte[]version=Longs.toByteArray(System.currentTimeMillis());// Using the SDK item builder class to create the document with// appropriate attributes. This can be expanded to include metadata// fields etc.Itemitem=IndexingItemBuilder.fromConfiguration(documentId).setItemType(IndexingItemBuilder.ItemType.CONTENT_ITEM).setAcl(acl).setSourceRepositoryUrl(IndexingItemBuilder.FieldOrValue.withValue(viewUrl)).setVersion(version).build();

Create the indexable item

Once you have set the metadata for the item, you can create the actual indexableitem using theRepositoryDoc.Builder.The following example shows how to create a single indexable item.

GraphTraversalSample.java
// For this sample, content is just plain textStringcontent=String.format("Hello world from sample doc %s",documentId);ByteArrayContentbyteContent=ByteArrayContent.fromString("text/plain",content);RepositoryDoc.BuilderdocBuilder=newRepositoryDoc.Builder().setItem(item).setContent(byteContent,IndexingService.ContentFormat.TEXT);

ARepositoryDoc is a type ofApiOperation that performs the actualIndexingService.indexItem() request.

You can also use thesetRequestMode() method of theRepositoryDoc.Builderclass to identify the indexing request asASYNCHRONOUS orSYNCHRONOUS:

ASYNCHRONOUS
Asynchronous mode results in longer indexing-to-serving latency andaccommodates large throughput quota for indexing requests. Asynchronous mode isrecommended for initial indexing (backfill) of the entire repository.
SYNCHRONOUS
Synchronous mode results in shorter indexing-to-serving latency andaccommodates limited throughput quota. Synchronous mode isrecommended for indexing of updates and changes to the repository. Ifunspecified, the request mode defaults toSYNCHRONOUS.

Place the child IDs in the Cloud Search Indexing Queue

The following code snippet shows how to include the child IDs, for thecurrently processing parent item, into the queue for processing. These IDsare processed after the parent item is indexed.

GraphTraversalSample.java
// Queue the child nodes to visit after indexing this documentSet<String>childIds=getChildItemNames(documentId);for(Stringid:childIds){log.info(()->String.format("Pushing child node %s",id));PushItempushItem=newPushItem();docBuilder.addChildId(id,pushItem);}RepositoryDocdoc=docBuilder.build();
Note: Cloud Search indexes the meta data for all files. Depending on the filetype of an item, Cloud Search also indexes the text content, including usingOptical Character Recognition (OCR) on hand-written content. For a list offile types supporting text extraction, refer toSupported file types for text extraction.

Next Steps

Here are a few next steps you might take:

Create a content connector using the REST API

The following sections explain how to create a content connector using theREST API.

Determine your traversal strategy

The primary function of a content connector is to traverse a repository andindex its data. You must implement a traversal strategy based on the size andlayout of data in your repository. Following are three common traversalstrategies:

Full traversal strategy

A full traversal strategy scans the entire repository and blindly indexesevery item. This strategy is commonly used when you have a small repository andcan afford the overhead of doing a full traversal every time you index.

This traversal strategy is suitable for small repositories with mostlystatic, non-hierarchical, data. You might also use this traversal strategywhen change detection is difficult or not supported by the repository.

List traversal strategy

A list traversal strategy scans the entire repository, including all childnodes, determining the status of each item. Then, the connector takes a secondpass and only indexes items that are new or have been updated since the lastindexing. This strategy is commonly used to perform incrementalupdates to an existing index (instead of having to do a full traversal everytime you update the index).

This traversal strategy is suitable when change detection is difficult ornot supported by the repository, you have non-hierarchical data, and you areworking with very large data sets.

Graph traversal

A graph traversal strategy scans the entire parent node determining thestatus of each item. Then, the connector takes a second pass and only indexesitems in the root node are new or have been updated since the last indexing.Finally, the connector passes any child IDs then indexes items in the child nodesthat are new or have been updated. The connector continues recursively throughall child nodes until all items have been addressed. Such traversal is typicallyused for hierarchical repositories where listing of all IDs isn'tpractical.

This strategy is suitable if you have hierarchical data that needs to becrawled, such as a series directories or web pages.

Note: The terms “item” and “document” are synonymous in this document andsample code.

Implement your traversal strategy and index items

Every indexable element for Cloud Search is referred to as anitem inthe Cloud Search API. An item might be a file, folder, a line in a CSV file, ora database record.

Once your schema is registered, you can populate the index by:

  1. (optional) Usingitems.uploadto upload files larger than 100KiB for indexing. For smaller files, embed the content asinlineContentusingitems.index.

  2. (optional) Usingmedia.uploadto upload media files for indexing.

  3. Usingitems.index to index the item.For example, if your schema uses the object definition in themovieschema, an indexing request for a singleitem would look like this:

    {"name":"datasource/<data_source_id>/items/titanic","acl":{"readers":[{"gsuitePrincipal":{"gsuiteDomain":true}}]},"metadata":{"title":"Titanic","viewUrl":"http://www.imdb.com/title/tt2234155/?ref_=nv_sr_1","objectType":"movie"},"structuredData":{"object":{"properties":[{"name":"movieTitle","textValues":{"values":["Titanic"]}},{"name":"releaseDate","dateValues":{"values":[{"year":1997,"month":12,"day":19}]}},{"name":"actorName","textValues":{"values":["Leonardo DiCaprio","Kate Winslet","Billy Zane"]}},{"name":"genre","enumValues":{"values":["Drama","Action"]}},{"name":"userRating","integerValues":{"values":[8]}},{"name":"mpaaRating","textValues":{"values":["PG-13"]}},{"name":"duration","textValues":{"values":["3 h 14 min"]}}]}},"content":{"inlineContent":"A seventeen-year-old aristocrat falls in love with a kind but poor artist aboard the luxurious, ill-fated R.M.S. Titanic.","contentFormat":"TEXT"},"version":"01","itemType":"CONTENT_ITEM"}
  4. (Optional) Usingitems.getcalls to verify anitemhas been indexed.

To perform a full traversal, you would periodically reindex the entirerepository. To perform a list or graph traversal, you need to implementcode tohandle repository changes.

Note: TheobjectType field inItemMetadata should match the objectdefinition name in the schema. In this way, Cloud Search identifies the expectedstructure of the object in the index request. If, in an indexing request, yousend structured data that contains propertiesnot registered with the currentschema, or you give an object type that does not correspond to the name of oneof the object definitions in the schema, the object or property is ignored.Additionally, if you send properties with a type that is different from the typeregistered in the schema, Cloud Search returns an error response at indexingtime.Note: Each HTTP connection that your client makes results in overhead. To reduceoverhead, you can batch multiple API calls together into a single HTTP request.To learn how to reduce the overhead of your HTTP connections, refer toRequest batching. Tolearn how to handle large media uploads, refer toDirect and Resumable Media Uploads.

Handle repository changes

You can periodically gather and index each item from a repository to perform afull indexing. While effective at ensuring your index is up-to-date, a fullindexing can be costly when dealing with larger or hierarchical repositories.

Instead of using index calls to index an entire repository every so often, youcan also use theGoogle Cloud Indexing Queueas a mechanism for tracking changes and only indexing those items that havechanged. You can use theitems.pushrequests to push items into the queue for later polling and updating. For moreinformation on the Google Cloud Indexing Queue, refer toGoogle Cloud Indexing Queue.

For further information on the Google Cloud Search API, refer toCloud Search API.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-11 UTC.