Why segment merge in Elasticsearch requires stopping the writes to index

Answer a question I am looking to run the optimize(ES 1.X) which is now known as forcemerge API in ES latest version. After reading some articles like this and this. it seems we should run it only on

EScheng

57人浏览 · 2022-09-07 23:13:56

EScheng · 2022-09-07 23:13:56 发布

Answer a question

I am looking to run the optimize(ES 1.X) which is now known as forcemerge API in ES latest version. After reading some articles like this and this. it seems we should run it only on read-only indices, quoting the official ES docs:

Force merge should only be called against read-only indices. Running force merge against a read-write index can cause very large segments to be produced (>5Gb per segment)

But I don't understand the

Reason behind putting index on read-only mode before running forcemerge or optimize API.
As explained in above ES doc, it could cause very large segments which shouldn't be the case as what I understand is that, new updates are first written in memory which are written to segments when refresh happens, so why having write during forcemerge can produce the very large segments?

Also is there is any workaround if we don't want to put the index on read-only mode and still run force merge to expunge delete.

Let me know if I need to provide any additional information.

Answers

forcemerge can significantly improve the performance of your queries as it allows you to merge the existing number of segments into a smaller number of segments which is more efficient for querying, as segments get searched sequentially. While merging, also all documents marked for deletion get cleaned up.

Merging happens regularly and automatically in the background as part of Elasticsearch‘s housekeeping based on a merge policy.

The tricky thing: only segments up to 5 GB are considered by the merge policy. Using the forcemerge API with the parameter that allows you to specify the number of resulting segments, you risk that the resulting segment(s) get bigger than 5GB, meaning that they will no longer be considered by future merge requests. As long as you don‘t delete or update documents there is nothing wrong about that. However, if you keep on deleting or updating documents, Lucene will mark the old version of your documents in the existing segments as deleted and write the new version of your documents into new segments. If your deleted documents reside in segments larger than 5GB, no more housekeeping is done on them, i.e. the documents marked for deletion will never get cleaned up.

By setting an index to readonly prior to doing a force-merge, you ensure that you will not end up with huge segments, containing a lot of legacy documents, which consume precious resources in memory and on disk and slow down your queries.

A refresh is doing something different: it‘s correct that documents you want to get indexed are first processed in memory, before getting written to disk. But the data structure that allows you to actually find a document (the „segment“) does not get created for every single document right away, as this would be highly inefficient. Segments are only created when the internal buffer gets full, or when a refresh occurs. By triggering a refresh you make a document immediately available for finding. Still the segment at first only lives in memory, as - again - it would be extremely inefficient to immediately sync every segment to disk right after it got created. Segments in memory get periodically synced to disk. Even if you pull the plug before a sync to disk happened you don‘t lose any information, as Elasticsearch maintains a translog that will allow Elasticsearch to „replay“ all indexing request that did not make it yet into a segment on disk.

Elastic中国社区

欢迎大家访问Elastic 中国社区。由Elastic 资深布道师，Elastic 认证工程师，认证分析师，认证可观测性工程师运营管理。

更多推荐

修复网站搜索引擎的 5 个步骤

您网站上的搜索功能很糟糕。是的,在那里,我说过。我这么说并没有冒太大风险,因为我们入职的大多数客户都带着一个共同的需求来找我们:改善他们的搜索体验。糟糕的网站搜索通常是由于没有投入足够的时间、精力和金钱来围绕您的网站内容和搜索制定良好的策略。今天,我将与您分享改造和改进网站搜索引擎的五个步骤。让我们开始吧。今天谈论搜索,它真的成为最终用户数字体验的关键部分,你不同意吗?随着大量内容的发布,

Elastic中国社区

ElasticSearch:从零到英雄的 12 个命令

开始使用 ElasticSearch 相对容易。但是随着我们的用例变得更加具体,我们发现缺少文档。这个引导式备忘单将执行 12 个命令:从设置 ES 索引到进行高级 ES 查询以支持高级(但常见)用例。 12 个命令按顺序执行。我将解释它们中的每一个,但自己尝试仍然是最好的。这篇文章是关于 ElasticSearch 的更广泛系列的一部分,该系列将在未来几周内发布: 开始使用 ES 所需的引导式

Elastic中国社区

logstash geoip.location 映射到 geo_point 不起作用

问题:logstash geoip.location 映射到 geo_point 不起作用我可以在我的默认映射中看到 geoip.location 映射到 geo_point 类型: GET myserver:9200/_template { "logstash": { "order": 0, "version": 50001, "template": "logstash-*", "settin