NUTCH-3162 Latency metrics to properly merge data from all threads and tasks by lewismc · Pull Request #906 · apache/nutch

lewismc · 2026-03-11T13:25:10Z

PR for NUTCH-3162 which addresses shortcomings in job-level latency percentiles (p50, p95, p99) for Fetcher, ParseSegment, and Indexer by merging TDigest data from all map tasks and threads and writing counters in a single reducer (or a dedicated merge job for Indexer). It should fix the cases where per-task counters were summed and percentiles were not merged.

This patch touches the following jobs

Fetcher: Per-thread latency merged in mapper; single reducer merges TDigests and sets job-level p50/p95/p99.
ParseSegment:
- Mapper emits latency digest under LATENCY_KEY
- Custom partitioner sends LATENCY_KEY to partition 0 so one reducer merges all TDigests
- Reducer merges and sets correct percentile counters.
Indexer:
- Reducer writes TDigest to side output
- IndexingJob runs a new “Indexer Latency Merge” job which merges reducer sets percentile counters. On merge failure: LOG.error and driver-level ErrorTracker categorization is only run.

I think this fixes the issues. Arguably it is more complex than logging to file and performing some ETL to extract metrics from logs however this solution does stick with convention by keeping metrics within the Hadoop ecosystem.

Finally, the PR is complemented with unit tests. This asllowed me to think more about how we can add metrics validation in Nutch but that will come in a separate issue/PR under NUTCH-3131.

Thanks for any review.

…d tasks

lewismc · 2026-03-11T18:29:03Z

Refactored tests and introduced LatencyTestUtil.java to centralize boilerplate for latency-generating test code.

sonarqubecloud · 2026-03-11T18:39:45Z

Quality Gate failed

Failed conditions
43.2% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

…d tasks Wrap ParseImpl and BytesWritable holding the serialized T-Digest into a NutchWritable object.

sebastian-nagel

Hi @lewismc,

after fixing the ParseSegment job in f1d5dc3, testing various jobs was successful except for the IndexingJob which fails when merging the latency metrics, see inline comments.

I'll continue tomorrow:

fix the unit tests for the parse job,
verify the latency metrics job counters in the log files,
have a closer look into the IndexingJob.

sebastian-nagel · 2026-04-08T19:46:14Z

+      if (fs.exists(latencyDir)) {
+        try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf, latencyDir)) {
+          FileOutputFormat.setOutputPath(mergeJob, new Path(tmp, "_latency_merge_out"));
+          boolean mergeSuccess = mergeJob.waitForCompletion(true);


2026-04-08 21:38:08,406 ERROR o.a.n.i.IndexingJob [main] Indexer: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/mnt/data/wastl/proj/crawler/nutch/test/tmp_1775677086987-2006747810/_latency at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:342) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:281) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:445) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:311) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:328) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:201) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1677) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1674) at java.base/java.security.AccessController.doPrivileged(AccessController.java:712) at java.base/javax.security.auth.Subject.doAs(Subject.java:439) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1674) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1695) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:167) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:320) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:329)

when running

bin/nutch index -Dplugin.includes='indexer-dummy|index-(basic|more)' -nocrawldb /path/to/segments/20260408205641/ ```

Or when running on a single-node cluster (indexing to Solr):

2026-04-08 21:50:01,953 ERROR indexer.IndexingJob: Indexer: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/wastl/tmp_1775677717095--1619189953/_latency

This only affects the latency-merge job.

Looked into SequenceFileInputFormat.listStatus: it either takes a single file or a directory tree with sequence files _latency/part-xxx/data.

Yes, I wondered about this. I am not a huge fan of the intermediate output being written for IndexerJob either. I think we could even remove the changes for this job and address them separately. This will NOT have an impact on the Job execution... however the counters are not accurate.

…d tasks Fix unit test.

sebastian-nagel

Hi @lewismc,

TestParseSegment is fixed.

I've verified that the latency metrics are plausible for all jobs, except the indexer.

So my +1 for all parts except the indexer, so far.

For the indexer, I'm not 100% convinced regarding the solution with a temporary output folder.

We also had this issue with indexers writing files (indexer-csv, see NUTCH-2793 / PR #534): it might be an improvement to provide the possibility to the indexer plugins to write data to an index output directory. Because potentially multiple indexers can be used in the same job we would need subdirectories anyway. This would allow to have also one for the latency metrics. Would be more transparent.

sebastian-nagel · 2026-04-09T11:30:37Z

+      if (fs.exists(latencyDir)) {
+        try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf, latencyDir)) {
+          FileOutputFormat.setOutputPath(mergeJob, new Path(tmp, "_latency_merge_out"));
+          boolean mergeSuccess = mergeJob.waitForCompletion(true);


Looked into SequenceFileInputFormat.listStatus: it either takes a single file or a directory tree with sequence files _latency/part-xxx/data.

lewismc · 2026-04-11T08:07:54Z

The temp Indexer output isn't ideal. Thank you for reviewing. This "Indexer Latency Merge" is one solution. I will follow up with alternative(s) soon.
Additionally, even documentation the metrics evolution is something I need to get around to. As of right now I added a notice to the wiki metrics documentation to state it is superseded.

The best current documentation is our Javadoc

I think we can do better than Javadoc and I am thinking about how we package better documentation.

lewismc added 2 commits March 6, 2026 16:00

NUTCH-3162 Latency metrics to properly merge data from all threads an…

cc7b93a

…d tasks

NUTCH-3162 Latency metrics to properly merge data from all threads an…

117ed9a

…d tasks

lewismc requested a review from sebastian-nagel March 11, 2026 13:30

lewismc self-assigned this Mar 11, 2026

NUTCH-3162 Latency metrics to properly merge data from all threads an…

5028b15

…d tasks

lewismc commented Mar 11, 2026

View reviewed changes

Comment thread src/java/org/apache/nutch/indexer/IndexingJob.java

NUTCH-3162 Latency metrics to properly merge data from all threads an…

ba9c15a

…d tasks

NUTCH-3162 Latency metrics to properly merge data from all threads an…

f1d5dc3

…d tasks Wrap ParseImpl and BytesWritable holding the serialized T-Digest into a NutchWritable object.

sebastian-nagel requested changes Apr 8, 2026

View reviewed changes

NUTCH-3162 Latency metrics to properly merge data from all threads an…

f11a945

…d tasks Fix unit test.

sebastian-nagel requested changes Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUTCH-3162 Latency metrics to properly merge data from all threads and tasks#906

NUTCH-3162 Latency metrics to properly merge data from all threads and tasks#906
lewismc wants to merge 6 commits intoapache:masterfrom
lewismc:NUTCH-3162

lewismc commented Mar 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

lewismc commented Mar 11, 2026

Uh oh!

sonarqubecloud Bot commented Mar 11, 2026

Uh oh!

sebastian-nagel left a comment •

edited

Loading

Uh oh!

Uh oh!

sebastian-nagel Apr 8, 2026

Uh oh!

sebastian-nagel Apr 8, 2026

Uh oh!

sebastian-nagel Apr 9, 2026

Uh oh!

lewismc Apr 11, 2026

Uh oh!

sebastian-nagel left a comment

Uh oh!

sebastian-nagel Apr 9, 2026

Uh oh!

lewismc commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lewismc commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lewismc commented Mar 11, 2026

Uh oh!

sonarqubecloud Bot commented Mar 11, 2026

Quality Gate failed

Uh oh!

sebastian-nagel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sebastian-nagel Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

lewismc Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

lewismc commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lewismc commented Mar 11, 2026 •

edited

Loading

sebastian-nagel left a comment •

edited

Loading