Add md5 sum function by pfultz2 · Pull Request #4801 · ROCm/AMDMIGraphX

pfultz2 · 2026-04-20T15:43:11Z

Motivation

This can help support persistent binary caches in the future.

Technical Details

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

- Added: New functionality.
- Changed: Changes to existing functionality.
- Removed: Functionality or support that has been removed. (Compared to a previous release)
- Optimized: Component performance that has been optimized or improved.
- Resolved Issues: Known issues from a previous version that have been resolved.
- Not Applicable: This PR is not to be included in the changelog.

Copilot

Pull request overview

Adds an MD5 hashing utility to MIGraphX, along with a reusable hex-encoding helper, as groundwork for future persistent binary cache keys/checksums.

Changes:

Introduces public API migraphx::md5(std::string_view) returning a lowercase hex digest.
Adds migraphx::to_hex_string(Range, bool lsb) for integer-range hex serialization (including MD5-style little-endian byte layout).
Expands unit tests to cover MD5 RFC vectors and hex encoding behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/include/migraphx/md5.hpp`	Declares the exported `md5` API.
`src/md5.cpp`	Implements the MD5 algorithm and hex formatting of the final state.
`src/include/migraphx/stringutils.hpp`	Adds `to_hex_string` helper used by MD5 (and potentially future cache-key work).
`test/md5.cpp`	Adds extensive MD5 test vectors (RFC 1321 + edge cases).
`test/stringutils.cpp`	Adds tests validating `to_hex_string` endianness and formatting.
`src/CMakeLists.txt`	Wires `md5.cpp` into the `migraphx` library build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-20T15:51:19Z

+        const auto chunk_begin = str.begin() + (i * block_size);
+        std::transform(chunk_begin, chunk_begin + block_size, block.begin(), &to_uint8);


These transforms feed char data into to_uint8(int8_t). On platforms where char is unsigned, bytes >= 0x80 can be altered by the intermediate int8_t conversion, yielding incorrect MD5s for non-ASCII data. Consider using a lambda that casts char -> unsigned char -> std::uint8_t directly.

Copilot · 2026-04-20T15:51:19Z

+    return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {
+        using T            = std::make_unsigned_t<std::decay_t<decltype(x)>>;
+        const auto u       = static_cast<T>(x);
+        const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t {
+            return (u >> (b * 8u)) & 0xffu;
+        };
+        const auto append_hex = [&](std::string s, std::uint8_t byte) {
+            s.push_back(hex_digits[byte >> 4u]);
+            s.push_back(hex_digits[byte & 0x0fu]);
+            return s;
+        };
+        const auto bytes = range(sizeof(T));
+        if(lsb)
+        {
+            return transform_accumulate(
+                bytes.begin(), bytes.end(), std::move(acc), append_hex, to_byte);
+        }
+        const auto rbytes = reverse(bytes);
+        return transform_accumulate(
+            rbytes.begin(), rbytes.end(), std::move(acc), append_hex, to_byte);
+    });


to_hex_string builds the result via std::accumulate and (inside) transform_accumulate, both of which pass/return std::string by value. This causes repeated string copies and can become quadratic for larger inputs. Consider appending into a single std::string (and reserving the final size when possible).

Suggested change

return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {

using T = std::make_unsigned_t<std::decay_t<decltype(x)>>;

const auto u = static_cast<T>(x);

const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t {

return (u >> (b * 8u)) & 0xffu;

};

const auto append_hex = [&](std::string s, std::uint8_t byte) {

s.push_back(hex_digits[byte >> 4u]);

s.push_back(hex_digits[byte & 0x0fu]);

return s;

};

const auto bytes = range(sizeof(T));

if(lsb)

{

return transform_accumulate(

bytes.begin(), bytes.end(), std::move(acc), append_hex, to_byte);

}

const auto rbytes = reverse(bytes);

return transform_accumulate(

rbytes.begin(), rbytes.end(), std::move(acc), append_hex, to_byte);

});

using T = std::make_unsigned_t<std::decay_t<decltype(*r.begin())>>;

std::string result;

result.reserve(r.size() * sizeof(T) * 2);

const auto append_hex = [&](std::uint8_t byte) {

result.push_back(hex_digits[byte >> 4u]);

result.push_back(hex_digits[byte & 0x0fu]);

};

for(const auto& x : r)

{

const auto u = static_cast<T>(x);

const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t {

return (u >> (b * 8u)) & 0xffu;

};

const auto bytes = range(sizeof(T));

if(lsb)

{

for(auto b : bytes)

append_hex(to_byte(b));

}

else

{

for(auto b : reverse(bytes))

append_hex(to_byte(b));

}

}

return result;

Copilot · 2026-04-20T15:51:19Z

+TEST_CASE(to_hex_string_md5_initial_state_lsb)
+{
+    // LSB ordering matches MD5's canonical byte layout: the initial state
+    // serialized LSB-first is the well-known digest of the empty string.


The comment claims this value is "the well-known digest of the empty string", but the expected hex (0123...3210) is the MD5 initial state serialized little-endian, not the empty-string digest (d41d8...). Please adjust the wording so the test documentation matches what is being asserted.

Suggested change

// serialized LSB-first is the well-known digest of the empty string.

// serialized LSB-first yields 0123456789abcdeffedcba9876543210.

Copilot · 2026-04-20T15:51:20Z

+namespace migraphx {
+inline namespace MIGRAPHX_INLINE_NS {
+
+/// Compute the MD5 digest of a string and return it as a lowercase hex string.


Since md5 is a public API and MD5 is not suitable for cryptographic security, consider documenting that this function is intended for non-cryptographic checksums (e.g., cache keys) so callers don't misinterpret it as secure hashing.

Suggested change

/// Compute the MD5 digest of a string and return it as a lowercase hex string.

/// Compute the MD5 digest of a string and return it as a lowercase hex string.

/// This function is intended only for non-cryptographic checksums such as cache

/// keys or content fingerprints. MD5 is not cryptographically secure and must

/// not be used for passwords, signatures, or other security-sensitive hashing.

Copilot · 2026-04-20T15:51:20Z

+    constexpr std::array<char, 16> hex_digits = {
+        '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};
+    return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {
+        using T            = std::make_unsigned_t<std::decay_t<decltype(x)>>;


to_hex_string is an unconstrained template but assumes elements are integral types usable with std::make_unsigned_t. If called with a range of non-integral types (or bool), this will fail with hard-to-decipher template errors. Add a constraint/static_assert on the element type to make misuse fail clearly.

Suggested change

constexpr std::array<char, 16> hex_digits = {

'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};

return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {

using T = std::make_unsigned_t<std::decay_t<decltype(x)>>;

using element_type = std::decay_t<decltype(*r.begin())>;

static_assert(std::is_integral_v<element_type> and not std::is_same_v<element_type, bool>,

"to_hex_string requires a range of integral, non-bool elements");

constexpr std::array<char, 16> hex_digits = {

'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};

return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {

using T = std::make_unsigned_t<element_type>;

Copilot · 2026-04-20T15:51:20Z

+    return {state[0] + a, state[1] + b, state[2] + c, state[3] + d};
+}
+
+std::uint8_t to_uint8(std::int8_t c) { return bit_cast<std::uint8_t>(c); }


to_uint8 takes std::int8_t. When hashing arbitrary bytes from a std::string_view (which yields char), converting values >= 0x80 through int8_t is implementation-defined if char is unsigned, and can change the input bytes. Prefer converting from char via unsigned char to std::uint8_t (no bit_cast needed).

Copilot · 2026-04-20T15:51:21Z

+std::array<std::uint32_t, 4> process_block(std::array<std::uint32_t, 4> state,
+                                           std::array<std::uint8_t, block_size> block)


process_block takes the 64-byte block by value, which copies the entire block for every compression call. Since this runs once per 64 bytes of input, it can add avoidable overhead for large strings; take block by const& instead.

Suggested change

std::array<std::uint32_t, 4> process_block(std::array<std::uint32_t, 4> state,

std::array<std::uint8_t, block_size> block)

std::array<std::uint32_t, 4> process_block(

std::array<std::uint32_t, 4> state, const std::array<std::uint8_t, block_size>& block)

Copilot · 2026-04-20T15:51:21Z

+ */
+#include <migraphx/md5.hpp>
+#include <migraphx/bit_cast.hpp>
+#include <migraphx/stringutils.hpp>


This file uses range(...) and transform_partial_sum(...), but relies on those being transitively included via migraphx/stringutils.hpp. To make the dependency explicit (and avoid future breakage if stringutils.hpp stops including them), include the headers that define these utilities directly (e.g., migraphx/ranges.hpp and migraphx/algorithm.hpp).

Suggested change

#include <migraphx/stringutils.hpp>

#include <migraphx/stringutils.hpp>

#include <migraphx/ranges.hpp>

#include <migraphx/algorithm.hpp>

causten

Isn't there a library that you can call instead of writing it ourselves from scratch? Does it produce an identical hash as an existing library?

codecov · 2026-04-20T18:27:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4801      +/-   ##
===========================================
+ Coverage    92.49%   92.51%   +0.01%     
===========================================
  Files          583      584       +1     
  Lines        29562    29619      +57     
===========================================
+ Hits         27343    27400      +57     
  Misses        2219     2219

Files with missing lines	Coverage Δ
src/include/migraphx/stringutils.hpp	`99.01% <100.00%> (+0.17%)`	⬆️
src/md5.cpp	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pfultz2 · 2026-04-22T23:27:32Z

Isn't there a library that you can call instead of writing it ourselves from scratch?

We could use libcrypto from openssl or libressl but its a pretty big dependency for just one function(see here where it was removed from miopen by a user for this exact reason). Furthermore adding a openssl or libressl for all platforms gets a little tricky.

There is the open source implementation that could be copied(like what miopen did) but it was much easier to have claude write it from scratch instead of trying to fix all the warnings from clang, tidy and cppcheck. Its also a much cleaner and simpler implementation.

Does it produce an identical hash as an existing library?

Yes it does thats what the tests check. The firsts few tests are from the RFC 1321:

MD5 ("") = d41d8cd98f00b204e9800998ecf8427e
MD5 ("a") = 0cc175b9c0f1b6a831c399e269772661
MD5 ("abc") = 900150983cd24fb0d6963f7d28e17f72
MD5 ("message digest") = f96b697d7cb7938d525a2f31aaf161d0
MD5 ("abcdefghijklmnopqrstuvwxyz") = c3fcd3d76192e4007dfb496cca67e13b
MD5 ("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789") =
d174ab98d277d9f5a5611c2c9f419d9f
MD5 ("123456789012345678901234567890123456789012345678901234567890123456
78901234567890") = 57edf4a22be3c955ac49da2e2107b67a

The other ones I generated from the md5sum:

echo -n "The quick brown fox jumps over the lazy dog" | md5sum  # 9e107d9d372bb6826bd81d3542a419d
echo -n "The quick brown fox jumps over the lazy dog." | md5sum # e4d909c290d0fb1ca068ffaddf22cbd0
printf 'a%.0s' {1..55}  | md5sum                                # ef1772b6dff9a122358552954ad0df65
printf 'a%.0s' {1..56}  | md5sum                                # 3b0c8ac703f828b04c6c197006d17218
printf 'a%.0s' {1..63}  | md5sum                                # b06521f39153d618550606be297466d5
printf 'a%.0s' {1..64}  | md5sum                                # 014842d480b571495a4a0363793f7367
printf 'a%.0s' {1..65}  | md5sum                                # c743a45e0d2e6a95cb859adae0248435
printf 'a%.0s' {1..119} | md5sum                                # 8a7bd0732ed6a28ce75f6dabc90e1613
printf 'a%.0s' {1..120} | md5sum                                # 5f61c0ccad4cac44c75ff505e1f1e537

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Copilot · 2026-04-24T03:53:38Z

+    const bool need_two    = (remainder >= block_size - 8);
+    const std::uint64_t bit_length = std::uint64_t{str.size()} * 8u;
+    const auto& last               = need_two ? tail[1] : tail[0];
+    const auto bit_indices = range(8);
+    transform_partial_sum(
+        bit_indices.begin(),
+        bit_indices.end(),
+        last.end() - 8,
+        [](std::uint64_t acc, std::uint64_t) { return acc >> 8u; },


transform_partial_sum writes through the output iterator, but last is bound as const auto& so last.end() - 8 is a const_iterator and the assignment in transform_partial_sum won’t compile. Make last a non-const reference (e.g., auto& last = ...) so the length field can be written into tail[0/1].

Copilot · 2026-04-24T03:53:38Z

+/// Concatenate the lowercase hex representation of each integer in a range.
+/// Each element emits 2*sizeof(element) characters. When lsb is false the
+/// most-significant byte is emitted first (standard hex notation); when true,
+/// the least-significant byte is first (matches byte-stream hash digests).
+template <class Range>
+inline std::string to_hex_string(const Range& r, bool lsb = false)
+{
+    constexpr std::array<char, 16> hex_digits = {
+        '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};
+    return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {
+        using type         = std::make_unsigned_t<std::decay_t<decltype(x)>>;
+        const auto u       = bit_cast<type>(x);
+        const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t {


to_hex_string uses bit_cast(...), but this header doesn't include <migraphx/bit_cast.hpp>, so including <migraphx/stringutils.hpp> on its own will fail to compile (and the repo’s header-include tests compile each header standalone). Add the missing include (or avoid bit_cast here) to make the header self-contained.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Copilot · 2026-04-24T15:47:55Z

+        last.end() - 8,
+        [](std::uint64_t acc, std::uint64_t) { return acc >> 8u; },
+        [&](auto) { return bit_length; });


last is bound as a const reference (const auto& last = ...), but transform_partial_sum writes through the output iterator last.end() - 8. For std::array this yields a const_iterator, so this will not compile (and even if it did, it would be writing into a const object). Make last a non-const reference (or select the destination buffer via pointer/index) so the length bytes can be written.

pfultz2 added 14 commits April 18, 2026 22:30

Add md5 sum

46a4c48

Some cleanup

d370466

Refactor process_block

885a7fb

Format

4a6297a

Use std::rotate

28e8fb6

Use to_hex_string

a0d1150

Format

e48d6c3

Use partial_sum

07d3810

Use std::transform

fd92133

Some cleanup

6f8fff7

Format

48561da

Update comment

a1cb4c0

Pass by value

e60d38b

Add unit tests

fa10a9f

pfultz2 requested a review from causten as a code owner April 20, 2026 15:43

Copilot AI review requested due to automatic review settings April 20, 2026 15:43

Copilot started reviewing on behalf of pfultz2 April 20, 2026 15:44 View session

Copilot AI reviewed Apr 20, 2026

View reviewed changes

causten reviewed Apr 20, 2026

View reviewed changes

pfultz2 and others added 4 commits April 23, 2026 10:04

Fix tidy and review feedback

f8884ff

Format

e2095a0

Update year

5b2487f

Merge branch 'develop' into md5

8476d37

pfultz2 requested a review from Copilot April 24, 2026 03:48

Copilot started reviewing on behalf of pfultz2 April 24, 2026 03:49 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Merge branch 'develop' into md5

4a51820

pfultz2 requested a review from Copilot April 24, 2026 15:41

Copilot started reviewing on behalf of pfultz2 April 24, 2026 15:43 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

		const auto chunk_begin = str.begin() + (i * block_size);
		std::transform(chunk_begin, chunk_begin + block_size, block.begin(), &to_uint8);

	// serialized LSB-first is the well-known digest of the empty string.
	// serialized LSB-first yields 0123456789abcdeffedcba9876543210.

-/// Compute the MD5 digest of a string and return it as a lowercase hex string.
+/// Compute the MD5 digest of a string and return it as a lowercase hex string.
+/// This function is intended only for non-cryptographic checksums such as cache
+/// keys or content fingerprints. MD5 is not cryptographically secure and must
+/// not be used for passwords, signatures, or other security-sensitive hashing.

		std::array<std::uint32_t, 4> process_block(std::array<std::uint32_t, 4> state,
		std::array<std::uint8_t, block_size> block)

Conversation

pfultz2 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Changelog Category

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

causten left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pfultz2 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pfultz2 commented Apr 20, 2026 •

edited

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading

pfultz2 commented Apr 22, 2026 •

edited

Loading