Overview
In this article, I want to share some basics I learned recently about Elasticsearch: how to
index new data using Elasticsearch Java client. More precisely, I will talk
about how to send a single index request or multiple index requests in bulk.
This article is written under Elastchsearch 7.5 with Java client
(org.elasticsearch.client.Client) and Java 11.
Create Client
Before sending index requests, you need to create an Elasticsearch Java client. For
example, you can create a new TransportClient which connects remotely to an Elasticsearch cluster
using the transport module. According to Elastcisearch Java API
[7.5],
you can do it as follows:
// on startup
TransportClient client = new PreBuiltTransportClient(Settings.EMPTY)
    .addTransportAddress(new TransportAddress(InetAddress.getByName("host1"), 9300))
    .addTransportAddress(new TransportAddress(InetAddress.getByName("host2"), 9300));
// on shutdown
client.close();
If you were like me using the Elasticsearch Testing Framework, this is provided
by the ESSingleNodeTestCase or ESIntegTestCase, you can retrieve the
client from the base class:
public MyTest extends ESSingleNodeTestCase {
  @Test
  public void testIndex() throws Exception {
    // retrieve the client
    Client client = client();
    // index or do something else
    ...
  }
}
Note Although this article is mainly written with Legacy Client
(org.elasticsearch.client.Client), you should consider using Java High-Level
REST Client instead. More detail can be reached in the official Migration
Guide.
The Java High Level REST Client depends on the Elasticsearch core project. It
accepts the same request arguments as the TransportClient and returns the
same response objects. I chose the legacy Transport Client because I
heavily rely on Elasticsearch Testing Framework to write Elasticsearch blog
posts, the testing framework only provides builtin support for Legacy Client
and not the Java High-Level REST Client.
Index Request
Once established a connection with Elasticsearch server, you can use
Elasticsearch Client (org.elasticsearch.client.Client) to prepare a new index
request. Inside an index request, you have to provide the necessary information to make
the request valid: the target index name and the source to index. All other
parameters are optional. For example, the document id: if you don’t provide
that id, Elasticsearch will generate one for you. More advanced options are also
available, like routing, pipeline, type of operation, etc. The simplest form of
an index request looks like this:
IndexRequest idxRequest = new IndexRequest("my_index").source("{\"msg\":\"Hello world!\"}", XContentType.JSON);
IndexResponse idxResponse = client.index(idxRequest).actionGet();
If the specified index does not already exist, by default, it will be automatically created by the index operation. If no mapping exists, the index operation creates a dynamic mapping. By default, new fields and objects are automatically added to the mapping if needed.
Index Response
Here is the cURL version of the request above
curl -X POST http://localhost:9200/my_index/_doc/?pretty \
  -H 'Content-Type: application/json' \
  -d '{"msg":"hello world!"}'
And the response looks like this:
{
  "_index" : "my_index",
  "_type" : "_doc",
  "_id" : "lkOn8W4BnUQvY76RSQ7m",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}
The reply contains the index, type, document id, current version of the
document, the change that occurred to the document. It also provides
information about the replication process of the index operation. For example,
_shards.total indicates the number of shards the index operation should be
executed on, _shards.successful and _shards.failed tell the number of shards the index
operation succeeded on and the number of shards the index operation failed on.
All these pieces of information have equivalent in Java object, but I’m not going into
more detail right now.
Prepared Index Request
You can also create an index request in a similar way using “prepare-*” method in client. This is a builder pattern for method chaining:
IndexResponse idxResponse =
    client
        .prepareIndex()
        .setIndex("msg")
        .setSource("{\"msg\":\"Hello world!\"}", XContentType.JSON)
        .execute()
        .actionGet();
Personally I prefer method chaining, because on one side, it makes the code easy to format and read, and on the other side, it states explicitly the name of each paramter filled, which makes to code easy to understand. It makes up the lack of support of named parameters in Java.
Bulk Index Request
Bulk request allows you to send multiple index requests at the same time. You
just need to create a BulkRequest or BulkRequestBuilder, then add the index
requests into it.
BulkResponse response =
    client
        .prepareBulk()
        .add(indexRequest1)
        .add(indexRequest2)
        ...
        .execute()
        .actionGet();
However, after sending a bulk request, you will receive two levels of response: the bulk response and the bulk item responses related to each index request. Please check them carefully.
Content-Type
Support of different content type (XContent). XContent is a generic abstraction on top of handling content, inspired by JSON and pull parsing.
new IndexRequest("msg").source("{\"key\":\"value\"}", XContentType.JSON);
Here is a list of XContent type supported by Elasticsearch 7:
| Name | Media Type | Enum | 
|---|---|---|
| JSON | application/json | XContentType.JSON | 
| SMILE | application/smile | XContentType.SMILE | 
| YAML | application/yaml | XContentType.YAML | 
| CBOR | application/cbor | XContentType.CBOR | 
SMILE is a computer data interchange format based on JSON. It can also be considered a binary serialization of the generic JSON data model, which means tools that operate on JSON may be used with Smile as well, as long as a proper encoder/decoder exists for the tool. The name comes from first 2 bytes of the 4 byte header, which consists of Smiley “:)” followed by a linefeed: choice made to make it easier to recognize Smile-encoded data files using textual command-line tools. See Wikipedia for more detail. There is also a very nice article on Medium Understanding Smile — A data format based on JSON written by Ayush Gupta.
Refresh Policy
In each index request or a bulk request, it is possible to define a refresh policy. This setting is to control when changes made by this request are made visible to search. This can be defined by the enum “RefreshPolicy” or string marked in parentheses:
- NONE (false). This is the default policy, which means do not refresh after the request.
- IMMEDIATE (true). Force a refresh as part of this request. This refresh policy does not scale for high indexing or search throughput but is useful to present a consistent view for indices with very low traffic. And it is wonderful for tests!
- WAIT_UNTIL (wait_for). Leave this request open until a refresh has made the contents of this request visible to search. This refresh policy is compatible with high indexing and search throughput but it causes the request to wait to reply until a refresh occurs.
Conclusion
In this article, we saw the basics of sending a single request or a bulk request to Elasticsearch. We saw the structure of the index response. We also took a quick look at the supported content types and the refresh policy. The source code is available on GitHub as mincong-h/learning-elasticsearch, see classes IndexTest and HttpIndexIT. Interested to know more? You can subscribe to the feed of my blog, follow me on Twitter or GitHub. Hope you enjoy this article, see you the next time!
References
- Elastic, “Index API”, Elastic, 2019. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
- Elastic, “Migration Guide | Java REST Client [7.5]”, Elastic, 2019. https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.5/java-rest-high-level-migration.html
- Elastic, “?refresh | REST APIs”, Elastic, 2019. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html
- Ayush Gupta, “Understanding Smile — A data format based on JSON”, Medium, 2019. https://medium.com/code-with-ayush/understanding-smile-a-data-format-based-on-json-29972a37d376