Skip to content

NUTCH-3162 Latency metrics to properly merge data from all threads and tasks#906

Open
lewismc wants to merge 6 commits intoapache:masterfrom
lewismc:NUTCH-3162
Open

NUTCH-3162 Latency metrics to properly merge data from all threads and tasks#906
lewismc wants to merge 6 commits intoapache:masterfrom
lewismc:NUTCH-3162

Conversation

@lewismc
Copy link
Copy Markdown
Member

@lewismc lewismc commented Mar 11, 2026

PR for NUTCH-3162 which addresses shortcomings in job-level latency percentiles (p50, p95, p99) for Fetcher, ParseSegment, and Indexer by merging TDigest data from all map tasks and threads and writing counters in a single reducer (or a dedicated merge job for Indexer). It should fix the cases where per-task counters were summed and percentiles were not merged.

This patch touches the following jobs

  • Fetcher: Per-thread latency merged in mapper; single reducer merges TDigests and sets job-level p50/p95/p99.
  • ParseSegment:
    • Mapper emits latency digest under LATENCY_KEY
    • Custom partitioner sends LATENCY_KEY to partition 0 so one reducer merges all TDigests
    • Reducer merges and sets correct percentile counters.
  • Indexer:
    • Reducer writes TDigest to side output
    • IndexingJob runs a new “Indexer Latency Merge” job which merges reducer sets percentile counters. On merge failure: LOG.error and driver-level ErrorTracker categorization is only run.

I think this fixes the issues. Arguably it is more complex than logging to file and performing some ETL to extract metrics from logs however this solution does stick with convention by keeping metrics within the Hadoop ecosystem.

Finally, the PR is complemented with unit tests. This asllowed me to think more about how we can add metrics validation in Nutch but that will come in a separate issue/PR under NUTCH-3131.

Thanks for any review.

@lewismc lewismc requested a review from sebastian-nagel March 11, 2026 13:30
@lewismc lewismc self-assigned this Mar 11, 2026
Comment thread src/java/org/apache/nutch/indexer/IndexingJob.java
@lewismc
Copy link
Copy Markdown
Member Author

lewismc commented Mar 11, 2026

Refactored tests and introduced LatencyTestUtil.java to centralize boilerplate for latency-generating test code.

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
43.2% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

…d tasks

Wrap ParseImpl and BytesWritable holding the serialized T-Digest into a
NutchWritable object.
Copy link
Copy Markdown
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lewismc,

after fixing the ParseSegment job in f1d5dc3, testing various jobs was successful except for the IndexingJob which fails when merging the latency metrics, see inline comments.

I'll continue tomorrow:

  • fix the unit tests for the parse job,
  • verify the latency metrics job counters in the log files,
  • have a closer look into the IndexingJob.

Comment thread src/java/org/apache/nutch/parse/ParseSegment.java Outdated
if (fs.exists(latencyDir)) {
try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf, latencyDir)) {
FileOutputFormat.setOutputPath(mergeJob, new Path(tmp, "_latency_merge_out"));
boolean mergeSuccess = mergeJob.waitForCompletion(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026-04-08 21:38:08,406 ERROR o.a.n.i.IndexingJob [main] Indexer: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/mnt/data/wastl/proj/crawler/nutch/test/tmp_1775677086987-2006747810/_latency
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:342)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:281)
        at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:445)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:311)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:328)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:201)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1677)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1674)
        at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
        at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1674)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1695)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:167)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:320)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:329)

when running

bin/nutch index -Dplugin.includes='indexer-dummy|index-(basic|more)' -nocrawldb /path/to/segments/20260408205641/
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or when running on a single-node cluster (indexing to Solr):

2026-04-08 21:50:01,953 ERROR indexer.IndexingJob: Indexer: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/wastl/tmp_1775677717095--1619189953/_latency

This only affects the latency-merge job.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into SequenceFileInputFormat.listStatus: it either takes a single file or a directory tree with sequence files _latency/part-xxx/data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wondered about this. I am not a huge fan of the intermediate output being written for IndexerJob either. I think we could even remove the changes for this job and address them separately. This will NOT have an impact on the Job execution... however the counters are not accurate.

Copy link
Copy Markdown
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lewismc,

TestParseSegment is fixed.

I've verified that the latency metrics are plausible for all jobs, except the indexer.

So my +1 for all parts except the indexer, so far.

For the indexer, I'm not 100% convinced regarding the solution with a temporary output folder.

We also had this issue with indexers writing files (indexer-csv, see NUTCH-2793 / PR #534): it might be an improvement to provide the possibility to the indexer plugins to write data to an index output directory. Because potentially multiple indexers can be used in the same job we would need subdirectories anyway. This would allow to have also one for the latency metrics. Would be more transparent.

if (fs.exists(latencyDir)) {
try (Job mergeJob = IndexerMapReduce.createLatencyMergeJob(conf, latencyDir)) {
FileOutputFormat.setOutputPath(mergeJob, new Path(tmp, "_latency_merge_out"));
boolean mergeSuccess = mergeJob.waitForCompletion(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into SequenceFileInputFormat.listStatus: it either takes a single file or a directory tree with sequence files _latency/part-xxx/data.

@lewismc
Copy link
Copy Markdown
Member Author

lewismc commented Apr 11, 2026

The temp Indexer output isn't ideal. Thank you for reviewing. This "Indexer Latency Merge" is one solution. I will follow up with alternative(s) soon.
Additionally, even documentation the metrics evolution is something I need to get around to. As of right now I added a notice to the wiki metrics documentation to state it is superseded.

The best current documentation is our Javadoc

I think we can do better than Javadoc and I am thinking about how we package better documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants