Skip to content

Add md5 sum function#4801

Open
pfultz2 wants to merge 19 commits intodevelopfrom
md5
Open

Add md5 sum function#4801
pfultz2 wants to merge 19 commits intodevelopfrom
md5

Conversation

@pfultz2
Copy link
Copy Markdown
Collaborator

@pfultz2 pfultz2 commented Apr 20, 2026

Motivation

This can help support persistent binary caches in the future.

Technical Details

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

    • Added: New functionality.
    • Changed: Changes to existing functionality.
    • Removed: Functionality or support that has been removed. (Compared to a previous release)
    • Optimized: Component performance that has been optimized or improved.
    • Resolved Issues: Known issues from a previous version that have been resolved.
    • Not Applicable: This PR is not to be included in the changelog.

@pfultz2 pfultz2 requested a review from causten as a code owner April 20, 2026 15:43
Copilot AI review requested due to automatic review settings April 20, 2026 15:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an MD5 hashing utility to MIGraphX, along with a reusable hex-encoding helper, as groundwork for future persistent binary cache keys/checksums.

Changes:

  • Introduces public API migraphx::md5(std::string_view) returning a lowercase hex digest.
  • Adds migraphx::to_hex_string(Range, bool lsb) for integer-range hex serialization (including MD5-style little-endian byte layout).
  • Expands unit tests to cover MD5 RFC vectors and hex encoding behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/include/migraphx/md5.hpp Declares the exported md5 API.
src/md5.cpp Implements the MD5 algorithm and hex formatting of the final state.
src/include/migraphx/stringutils.hpp Adds to_hex_string helper used by MD5 (and potentially future cache-key work).
test/md5.cpp Adds extensive MD5 test vectors (RFC 1321 + edge cases).
test/stringutils.cpp Adds tests validating to_hex_string endianness and formatting.
src/CMakeLists.txt Wires md5.cpp into the migraphx library build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/md5.cpp
Comment on lines +125 to +126
const auto chunk_begin = str.begin() + (i * block_size);
std::transform(chunk_begin, chunk_begin + block_size, block.begin(), &to_uint8);
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These transforms feed char data into to_uint8(int8_t). On platforms where char is unsigned, bytes >= 0x80 can be altered by the intermediate int8_t conversion, yielding incorrect MD5s for non-ASCII data. Consider using a lambda that casts char -> unsigned char -> std::uint8_t directly.

Copilot uses AI. Check for mistakes.
Comment on lines +253 to +273
return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {
using T = std::make_unsigned_t<std::decay_t<decltype(x)>>;
const auto u = static_cast<T>(x);
const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t {
return (u >> (b * 8u)) & 0xffu;
};
const auto append_hex = [&](std::string s, std::uint8_t byte) {
s.push_back(hex_digits[byte >> 4u]);
s.push_back(hex_digits[byte & 0x0fu]);
return s;
};
const auto bytes = range(sizeof(T));
if(lsb)
{
return transform_accumulate(
bytes.begin(), bytes.end(), std::move(acc), append_hex, to_byte);
}
const auto rbytes = reverse(bytes);
return transform_accumulate(
rbytes.begin(), rbytes.end(), std::move(acc), append_hex, to_byte);
});
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_hex_string builds the result via std::accumulate and (inside) transform_accumulate, both of which pass/return std::string by value. This causes repeated string copies and can become quadratic for larger inputs. Consider appending into a single std::string (and reserving the final size when possible).

Suggested change
return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {
using T = std::make_unsigned_t<std::decay_t<decltype(x)>>;
const auto u = static_cast<T>(x);
const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t {
return (u >> (b * 8u)) & 0xffu;
};
const auto append_hex = [&](std::string s, std::uint8_t byte) {
s.push_back(hex_digits[byte >> 4u]);
s.push_back(hex_digits[byte & 0x0fu]);
return s;
};
const auto bytes = range(sizeof(T));
if(lsb)
{
return transform_accumulate(
bytes.begin(), bytes.end(), std::move(acc), append_hex, to_byte);
}
const auto rbytes = reverse(bytes);
return transform_accumulate(
rbytes.begin(), rbytes.end(), std::move(acc), append_hex, to_byte);
});
using T = std::make_unsigned_t<std::decay_t<decltype(*r.begin())>>;
std::string result;
result.reserve(r.size() * sizeof(T) * 2);
const auto append_hex = [&](std::uint8_t byte) {
result.push_back(hex_digits[byte >> 4u]);
result.push_back(hex_digits[byte & 0x0fu]);
};
for(const auto& x : r)
{
const auto u = static_cast<T>(x);
const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t {
return (u >> (b * 8u)) & 0xffu;
};
const auto bytes = range(sizeof(T));
if(lsb)
{
for(auto b : bytes)
append_hex(to_byte(b));
}
else
{
for(auto b : reverse(bytes))
append_hex(to_byte(b));
}
}
return result;

Copilot uses AI. Check for mistakes.
Comment thread test/stringutils.cpp
TEST_CASE(to_hex_string_md5_initial_state_lsb)
{
// LSB ordering matches MD5's canonical byte layout: the initial state
// serialized LSB-first is the well-known digest of the empty string.
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment claims this value is "the well-known digest of the empty string", but the expected hex (0123...3210) is the MD5 initial state serialized little-endian, not the empty-string digest (d41d8...). Please adjust the wording so the test documentation matches what is being asserted.

Suggested change
// serialized LSB-first is the well-known digest of the empty string.
// serialized LSB-first yields 0123456789abcdeffedcba9876543210.

Copilot uses AI. Check for mistakes.
namespace migraphx {
inline namespace MIGRAPHX_INLINE_NS {

/// Compute the MD5 digest of a string and return it as a lowercase hex string.
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since md5 is a public API and MD5 is not suitable for cryptographic security, consider documenting that this function is intended for non-cryptographic checksums (e.g., cache keys) so callers don't misinterpret it as secure hashing.

Suggested change
/// Compute the MD5 digest of a string and return it as a lowercase hex string.
/// Compute the MD5 digest of a string and return it as a lowercase hex string.
/// This function is intended only for non-cryptographic checksums such as cache
/// keys or content fingerprints. MD5 is not cryptographically secure and must
/// not be used for passwords, signatures, or other security-sensitive hashing.

Copilot uses AI. Check for mistakes.
Comment thread src/include/migraphx/stringutils.hpp Outdated
Comment on lines +251 to +254
constexpr std::array<char, 16> hex_digits = {
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};
return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {
using T = std::make_unsigned_t<std::decay_t<decltype(x)>>;
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_hex_string is an unconstrained template but assumes elements are integral types usable with std::make_unsigned_t. If called with a range of non-integral types (or bool), this will fail with hard-to-decipher template errors. Add a constraint/static_assert on the element type to make misuse fail clearly.

Suggested change
constexpr std::array<char, 16> hex_digits = {
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};
return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {
using T = std::make_unsigned_t<std::decay_t<decltype(x)>>;
using element_type = std::decay_t<decltype(*r.begin())>;
static_assert(std::is_integral_v<element_type> and not std::is_same_v<element_type, bool>,
"to_hex_string requires a range of integral, non-bool elements");
constexpr std::array<char, 16> hex_digits = {
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};
return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {
using T = std::make_unsigned_t<element_type>;

Copilot uses AI. Check for mistakes.
Comment thread src/md5.cpp Outdated
return {state[0] + a, state[1] + b, state[2] + c, state[3] + d};
}

std::uint8_t to_uint8(std::int8_t c) { return bit_cast<std::uint8_t>(c); }
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_uint8 takes std::int8_t. When hashing arbitrary bytes from a std::string_view (which yields char), converting values >= 0x80 through int8_t is implementation-defined if char is unsigned, and can change the input bytes. Prefer converting from char via unsigned char to std::uint8_t (no bit_cast needed).

Copilot uses AI. Check for mistakes.
Comment thread src/md5.cpp
Comment on lines +68 to +69
std::array<std::uint32_t, 4> process_block(std::array<std::uint32_t, 4> state,
std::array<std::uint8_t, block_size> block)
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process_block takes the 64-byte block by value, which copies the entire block for every compression call. Since this runs once per 64 bytes of input, it can add avoidable overhead for large strings; take block by const& instead.

Suggested change
std::array<std::uint32_t, 4> process_block(std::array<std::uint32_t, 4> state,
std::array<std::uint8_t, block_size> block)
std::array<std::uint32_t, 4> process_block(
std::array<std::uint32_t, 4> state, const std::array<std::uint8_t, block_size>& block)

Copilot uses AI. Check for mistakes.
Comment thread src/md5.cpp
*/
#include <migraphx/md5.hpp>
#include <migraphx/bit_cast.hpp>
#include <migraphx/stringutils.hpp>
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file uses range(...) and transform_partial_sum(...), but relies on those being transitively included via migraphx/stringutils.hpp. To make the dependency explicit (and avoid future breakage if stringutils.hpp stops including them), include the headers that define these utilities directly (e.g., migraphx/ranges.hpp and migraphx/algorithm.hpp).

Suggested change
#include <migraphx/stringutils.hpp>
#include <migraphx/stringutils.hpp>
#include <migraphx/ranges.hpp>
#include <migraphx/algorithm.hpp>

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@causten causten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a library that you can call instead of writing it ourselves from scratch? Does it produce an identical hash as an existing library?

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4801      +/-   ##
===========================================
+ Coverage    92.49%   92.51%   +0.01%     
===========================================
  Files          583      584       +1     
  Lines        29562    29619      +57     
===========================================
+ Hits         27343    27400      +57     
  Misses        2219     2219              
Files with missing lines Coverage Δ
src/include/migraphx/stringutils.hpp 99.01% <100.00%> (+0.17%) ⬆️
src/md5.cpp 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pfultz2
Copy link
Copy Markdown
Collaborator Author

pfultz2 commented Apr 22, 2026

Isn't there a library that you can call instead of writing it ourselves from scratch?

We could use libcrypto from openssl or libressl but its a pretty big dependency for just one function(see here where it was removed from miopen by a user for this exact reason). Furthermore adding a openssl or libressl for all platforms gets a little tricky.

There is the open source implementation that could be copied(like what miopen did) but it was much easier to have claude write it from scratch instead of trying to fix all the warnings from clang, tidy and cppcheck. Its also a much cleaner and simpler implementation.

Does it produce an identical hash as an existing library?

Yes it does thats what the tests check. The firsts few tests are from the RFC 1321:

MD5 ("") = d41d8cd98f00b204e9800998ecf8427e
MD5 ("a") = 0cc175b9c0f1b6a831c399e269772661
MD5 ("abc") = 900150983cd24fb0d6963f7d28e17f72
MD5 ("message digest") = f96b697d7cb7938d525a2f31aaf161d0
MD5 ("abcdefghijklmnopqrstuvwxyz") = c3fcd3d76192e4007dfb496cca67e13b
MD5 ("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789") =
d174ab98d277d9f5a5611c2c9f419d9f
MD5 ("123456789012345678901234567890123456789012345678901234567890123456
78901234567890") = 57edf4a22be3c955ac49da2e2107b67a

The other ones I generated from the md5sum:

echo -n "The quick brown fox jumps over the lazy dog" | md5sum  # 9e107d9d372bb6826bd81d3542a419d
echo -n "The quick brown fox jumps over the lazy dog." | md5sum # e4d909c290d0fb1ca068ffaddf22cbd0
printf 'a%.0s' {1..55}  | md5sum                                # ef1772b6dff9a122358552954ad0df65
printf 'a%.0s' {1..56}  | md5sum                                # 3b0c8ac703f828b04c6c197006d17218
printf 'a%.0s' {1..63}  | md5sum                                # b06521f39153d618550606be297466d5
printf 'a%.0s' {1..64}  | md5sum                                # 014842d480b571495a4a0363793f7367
printf 'a%.0s' {1..65}  | md5sum                                # c743a45e0d2e6a95cb859adae0248435
printf 'a%.0s' {1..119} | md5sum                                # 8a7bd0732ed6a28ce75f6dabc90e1613
printf 'a%.0s' {1..120} | md5sum                                # 5f61c0ccad4cac44c75ff505e1f1e537

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread src/md5.cpp
Comment on lines +138 to +146
const bool need_two = (remainder >= block_size - 8);
const std::uint64_t bit_length = std::uint64_t{str.size()} * 8u;
const auto& last = need_two ? tail[1] : tail[0];
const auto bit_indices = range(8);
transform_partial_sum(
bit_indices.begin(),
bit_indices.end(),
last.end() - 8,
[](std::uint64_t acc, std::uint64_t) { return acc >> 8u; },
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transform_partial_sum writes through the output iterator, but last is bound as const auto& so last.end() - 8 is a const_iterator and the assignment in transform_partial_sum won’t compile. Make last a non-const reference (e.g., auto& last = ...) so the length field can be written into tail[0/1].

Copilot uses AI. Check for mistakes.
Comment on lines +244 to +256
/// Concatenate the lowercase hex representation of each integer in a range.
/// Each element emits 2*sizeof(element) characters. When lsb is false the
/// most-significant byte is emitted first (standard hex notation); when true,
/// the least-significant byte is first (matches byte-stream hash digests).
template <class Range>
inline std::string to_hex_string(const Range& r, bool lsb = false)
{
constexpr std::array<char, 16> hex_digits = {
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};
return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) {
using type = std::make_unsigned_t<std::decay_t<decltype(x)>>;
const auto u = bit_cast<type>(x);
const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t {
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_hex_string uses bit_cast(...), but this header doesn't include <migraphx/bit_cast.hpp>, so including <migraphx/stringutils.hpp> on its own will fail to compile (and the repo’s header-include tests compile each header standalone). Add the missing include (or avoid bit_cast here) to make the header self-contained.

Copilot uses AI. Check for mistakes.
@pfultz2 pfultz2 requested a review from Copilot April 24, 2026 15:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Comment thread src/md5.cpp
Comment on lines +145 to +147
last.end() - 8,
[](std::uint64_t acc, std::uint64_t) { return acc >> 8u; },
[&](auto) { return bit_length; });
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last is bound as a const reference (const auto& last = ...), but transform_partial_sum writes through the output iterator last.end() - 8. For std::array this yields a const_iterator, so this will not compile (and even if it did, it would be writing into a const object). Make last a non-const reference (or select the destination buffer via pointer/index) so the length bytes can be written.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants