Answer a question

I am looking to run the optimize(ES 1.X) which is now known as forcemerge API in ES latest version. After reading some articles like this and this. it seems we should run it only on read-only indices, quoting the official ES docs:

Force merge should only be called against read-only indices. Running force merge against a read-write index can cause very large segments to be produced (>5Gb per segment)

But I don't understand the

  1. Reason behind putting index on read-only mode before running forcemerge or optimize API.
  2. As explained in above ES doc, it could cause very large segments which shouldn't be the case as what I understand is that, new updates are first written in memory which are written to segments when refresh happens, so why having write during forcemerge can produce the very large segments?

Also is there is any workaround if we don't want to put the index on read-only mode and still run force merge to expunge delete.

Let me know if I need to provide any additional information.

Answers

forcemerge can significantly improve the performance of your queries as it allows you to merge the existing number of segments into a smaller number of segments which is more efficient for querying, as segments get searched sequentially. While merging, also all documents marked for deletion get cleaned up.

Merging happens regularly and automatically in the background as part of Elasticsearch‘s housekeeping based on a merge policy.

The tricky thing: only segments up to 5 GB are considered by the merge policy. Using the forcemerge API with the parameter that allows you to specify the number of resulting segments, you risk that the resulting segment(s) get bigger than 5GB, meaning that they will no longer be considered by future merge requests. As long as you don‘t delete or update documents there is nothing wrong about that. However, if you keep on deleting or updating documents, Lucene will mark the old version of your documents in the existing segments as deleted and write the new version of your documents into new segments. If your deleted documents reside in segments larger than 5GB, no more housekeeping is done on them, i.e. the documents marked for deletion will never get cleaned up.

By setting an index to readonly prior to doing a force-merge, you ensure that you will not end up with huge segments, containing a lot of legacy documents, which consume precious resources in memory and on disk and slow down your queries.

A refresh is doing something different: it‘s correct that documents you want to get indexed are first processed in memory, before getting written to disk. But the data structure that allows you to actually find a document (the „segment“) does not get created for every single document right away, as this would be highly inefficient. Segments are only created when the internal buffer gets full, or when a refresh occurs. By triggering a refresh you make a document immediately available for finding. Still the segment at first only lives in memory, as - again - it would be extremely inefficient to immediately sync every segment to disk right after it got created. Segments in memory get periodically synced to disk. Even if you pull the plug before a sync to disk happened you don‘t lose any information, as Elasticsearch maintains a translog that will allow Elasticsearch to „replay“ all indexing request that did not make it yet into a segment on disk.

Logo

欢迎大家访问Elastic 中国社区。由Elastic 资深布道师,Elastic 认证工程师,认证分析师,认证可观测性工程师运营管理。

更多推荐