Conversation
There was a problem hiding this comment.
Pull request overview
Adds an MD5 hashing utility to MIGraphX, along with a reusable hex-encoding helper, as groundwork for future persistent binary cache keys/checksums.
Changes:
- Introduces public API
migraphx::md5(std::string_view)returning a lowercase hex digest. - Adds
migraphx::to_hex_string(Range, bool lsb)for integer-range hex serialization (including MD5-style little-endian byte layout). - Expands unit tests to cover MD5 RFC vectors and hex encoding behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/include/migraphx/md5.hpp |
Declares the exported md5 API. |
src/md5.cpp |
Implements the MD5 algorithm and hex formatting of the final state. |
src/include/migraphx/stringutils.hpp |
Adds to_hex_string helper used by MD5 (and potentially future cache-key work). |
test/md5.cpp |
Adds extensive MD5 test vectors (RFC 1321 + edge cases). |
test/stringutils.cpp |
Adds tests validating to_hex_string endianness and formatting. |
src/CMakeLists.txt |
Wires md5.cpp into the migraphx library build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const auto chunk_begin = str.begin() + (i * block_size); | ||
| std::transform(chunk_begin, chunk_begin + block_size, block.begin(), &to_uint8); |
There was a problem hiding this comment.
These transforms feed char data into to_uint8(int8_t). On platforms where char is unsigned, bytes >= 0x80 can be altered by the intermediate int8_t conversion, yielding incorrect MD5s for non-ASCII data. Consider using a lambda that casts char -> unsigned char -> std::uint8_t directly.
| return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) { | ||
| using T = std::make_unsigned_t<std::decay_t<decltype(x)>>; | ||
| const auto u = static_cast<T>(x); | ||
| const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t { | ||
| return (u >> (b * 8u)) & 0xffu; | ||
| }; | ||
| const auto append_hex = [&](std::string s, std::uint8_t byte) { | ||
| s.push_back(hex_digits[byte >> 4u]); | ||
| s.push_back(hex_digits[byte & 0x0fu]); | ||
| return s; | ||
| }; | ||
| const auto bytes = range(sizeof(T)); | ||
| if(lsb) | ||
| { | ||
| return transform_accumulate( | ||
| bytes.begin(), bytes.end(), std::move(acc), append_hex, to_byte); | ||
| } | ||
| const auto rbytes = reverse(bytes); | ||
| return transform_accumulate( | ||
| rbytes.begin(), rbytes.end(), std::move(acc), append_hex, to_byte); | ||
| }); |
There was a problem hiding this comment.
to_hex_string builds the result via std::accumulate and (inside) transform_accumulate, both of which pass/return std::string by value. This causes repeated string copies and can become quadratic for larger inputs. Consider appending into a single std::string (and reserving the final size when possible).
| return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) { | |
| using T = std::make_unsigned_t<std::decay_t<decltype(x)>>; | |
| const auto u = static_cast<T>(x); | |
| const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t { | |
| return (u >> (b * 8u)) & 0xffu; | |
| }; | |
| const auto append_hex = [&](std::string s, std::uint8_t byte) { | |
| s.push_back(hex_digits[byte >> 4u]); | |
| s.push_back(hex_digits[byte & 0x0fu]); | |
| return s; | |
| }; | |
| const auto bytes = range(sizeof(T)); | |
| if(lsb) | |
| { | |
| return transform_accumulate( | |
| bytes.begin(), bytes.end(), std::move(acc), append_hex, to_byte); | |
| } | |
| const auto rbytes = reverse(bytes); | |
| return transform_accumulate( | |
| rbytes.begin(), rbytes.end(), std::move(acc), append_hex, to_byte); | |
| }); | |
| using T = std::make_unsigned_t<std::decay_t<decltype(*r.begin())>>; | |
| std::string result; | |
| result.reserve(r.size() * sizeof(T) * 2); | |
| const auto append_hex = [&](std::uint8_t byte) { | |
| result.push_back(hex_digits[byte >> 4u]); | |
| result.push_back(hex_digits[byte & 0x0fu]); | |
| }; | |
| for(const auto& x : r) | |
| { | |
| const auto u = static_cast<T>(x); | |
| const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t { | |
| return (u >> (b * 8u)) & 0xffu; | |
| }; | |
| const auto bytes = range(sizeof(T)); | |
| if(lsb) | |
| { | |
| for(auto b : bytes) | |
| append_hex(to_byte(b)); | |
| } | |
| else | |
| { | |
| for(auto b : reverse(bytes)) | |
| append_hex(to_byte(b)); | |
| } | |
| } | |
| return result; |
| TEST_CASE(to_hex_string_md5_initial_state_lsb) | ||
| { | ||
| // LSB ordering matches MD5's canonical byte layout: the initial state | ||
| // serialized LSB-first is the well-known digest of the empty string. |
There was a problem hiding this comment.
The comment claims this value is "the well-known digest of the empty string", but the expected hex (0123...3210) is the MD5 initial state serialized little-endian, not the empty-string digest (d41d8...). Please adjust the wording so the test documentation matches what is being asserted.
| // serialized LSB-first is the well-known digest of the empty string. | |
| // serialized LSB-first yields 0123456789abcdeffedcba9876543210. |
| namespace migraphx { | ||
| inline namespace MIGRAPHX_INLINE_NS { | ||
|
|
||
| /// Compute the MD5 digest of a string and return it as a lowercase hex string. |
There was a problem hiding this comment.
Since md5 is a public API and MD5 is not suitable for cryptographic security, consider documenting that this function is intended for non-cryptographic checksums (e.g., cache keys) so callers don't misinterpret it as secure hashing.
| /// Compute the MD5 digest of a string and return it as a lowercase hex string. | |
| /// Compute the MD5 digest of a string and return it as a lowercase hex string. | |
| /// This function is intended only for non-cryptographic checksums such as cache | |
| /// keys or content fingerprints. MD5 is not cryptographically secure and must | |
| /// not be used for passwords, signatures, or other security-sensitive hashing. |
| constexpr std::array<char, 16> hex_digits = { | ||
| '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'}; | ||
| return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) { | ||
| using T = std::make_unsigned_t<std::decay_t<decltype(x)>>; |
There was a problem hiding this comment.
to_hex_string is an unconstrained template but assumes elements are integral types usable with std::make_unsigned_t. If called with a range of non-integral types (or bool), this will fail with hard-to-decipher template errors. Add a constraint/static_assert on the element type to make misuse fail clearly.
| constexpr std::array<char, 16> hex_digits = { | |
| '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'}; | |
| return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) { | |
| using T = std::make_unsigned_t<std::decay_t<decltype(x)>>; | |
| using element_type = std::decay_t<decltype(*r.begin())>; | |
| static_assert(std::is_integral_v<element_type> and not std::is_same_v<element_type, bool>, | |
| "to_hex_string requires a range of integral, non-bool elements"); | |
| constexpr std::array<char, 16> hex_digits = { | |
| '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'}; | |
| return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) { | |
| using T = std::make_unsigned_t<element_type>; |
| return {state[0] + a, state[1] + b, state[2] + c, state[3] + d}; | ||
| } | ||
|
|
||
| std::uint8_t to_uint8(std::int8_t c) { return bit_cast<std::uint8_t>(c); } |
There was a problem hiding this comment.
to_uint8 takes std::int8_t. When hashing arbitrary bytes from a std::string_view (which yields char), converting values >= 0x80 through int8_t is implementation-defined if char is unsigned, and can change the input bytes. Prefer converting from char via unsigned char to std::uint8_t (no bit_cast needed).
| std::array<std::uint32_t, 4> process_block(std::array<std::uint32_t, 4> state, | ||
| std::array<std::uint8_t, block_size> block) |
There was a problem hiding this comment.
process_block takes the 64-byte block by value, which copies the entire block for every compression call. Since this runs once per 64 bytes of input, it can add avoidable overhead for large strings; take block by const& instead.
| std::array<std::uint32_t, 4> process_block(std::array<std::uint32_t, 4> state, | |
| std::array<std::uint8_t, block_size> block) | |
| std::array<std::uint32_t, 4> process_block( | |
| std::array<std::uint32_t, 4> state, const std::array<std::uint8_t, block_size>& block) |
| */ | ||
| #include <migraphx/md5.hpp> | ||
| #include <migraphx/bit_cast.hpp> | ||
| #include <migraphx/stringutils.hpp> |
There was a problem hiding this comment.
This file uses range(...) and transform_partial_sum(...), but relies on those being transitively included via migraphx/stringutils.hpp. To make the dependency explicit (and avoid future breakage if stringutils.hpp stops including them), include the headers that define these utilities directly (e.g., migraphx/ranges.hpp and migraphx/algorithm.hpp).
| #include <migraphx/stringutils.hpp> | |
| #include <migraphx/stringutils.hpp> | |
| #include <migraphx/ranges.hpp> | |
| #include <migraphx/algorithm.hpp> |
causten
left a comment
There was a problem hiding this comment.
Isn't there a library that you can call instead of writing it ourselves from scratch? Does it produce an identical hash as an existing library?
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #4801 +/- ##
===========================================
+ Coverage 92.49% 92.51% +0.01%
===========================================
Files 583 584 +1
Lines 29562 29619 +57
===========================================
+ Hits 27343 27400 +57
Misses 2219 2219
🚀 New features to boost your workflow:
|
We could use libcrypto from openssl or libressl but its a pretty big dependency for just one function(see here where it was removed from miopen by a user for this exact reason). Furthermore adding a openssl or libressl for all platforms gets a little tricky. There is the open source implementation that could be copied(like what miopen did) but it was much easier to have claude write it from scratch instead of trying to fix all the warnings from clang, tidy and cppcheck. Its also a much cleaner and simpler implementation.
Yes it does thats what the tests check. The firsts few tests are from the RFC 1321: The other ones I generated from the echo -n "The quick brown fox jumps over the lazy dog" | md5sum # 9e107d9d372bb6826bd81d3542a419d
echo -n "The quick brown fox jumps over the lazy dog." | md5sum # e4d909c290d0fb1ca068ffaddf22cbd0
printf 'a%.0s' {1..55} | md5sum # ef1772b6dff9a122358552954ad0df65
printf 'a%.0s' {1..56} | md5sum # 3b0c8ac703f828b04c6c197006d17218
printf 'a%.0s' {1..63} | md5sum # b06521f39153d618550606be297466d5
printf 'a%.0s' {1..64} | md5sum # 014842d480b571495a4a0363793f7367
printf 'a%.0s' {1..65} | md5sum # c743a45e0d2e6a95cb859adae0248435
printf 'a%.0s' {1..119} | md5sum # 8a7bd0732ed6a28ce75f6dabc90e1613
printf 'a%.0s' {1..120} | md5sum # 5f61c0ccad4cac44c75ff505e1f1e537 |
| const bool need_two = (remainder >= block_size - 8); | ||
| const std::uint64_t bit_length = std::uint64_t{str.size()} * 8u; | ||
| const auto& last = need_two ? tail[1] : tail[0]; | ||
| const auto bit_indices = range(8); | ||
| transform_partial_sum( | ||
| bit_indices.begin(), | ||
| bit_indices.end(), | ||
| last.end() - 8, | ||
| [](std::uint64_t acc, std::uint64_t) { return acc >> 8u; }, |
There was a problem hiding this comment.
transform_partial_sum writes through the output iterator, but last is bound as const auto& so last.end() - 8 is a const_iterator and the assignment in transform_partial_sum won’t compile. Make last a non-const reference (e.g., auto& last = ...) so the length field can be written into tail[0/1].
| /// Concatenate the lowercase hex representation of each integer in a range. | ||
| /// Each element emits 2*sizeof(element) characters. When lsb is false the | ||
| /// most-significant byte is emitted first (standard hex notation); when true, | ||
| /// the least-significant byte is first (matches byte-stream hash digests). | ||
| template <class Range> | ||
| inline std::string to_hex_string(const Range& r, bool lsb = false) | ||
| { | ||
| constexpr std::array<char, 16> hex_digits = { | ||
| '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'}; | ||
| return std::accumulate(r.begin(), r.end(), std::string{}, [&](std::string acc, const auto& x) { | ||
| using type = std::make_unsigned_t<std::decay_t<decltype(x)>>; | ||
| const auto u = bit_cast<type>(x); | ||
| const auto to_byte = [&](std::ptrdiff_t b) -> std::uint8_t { |
There was a problem hiding this comment.
to_hex_string uses bit_cast(...), but this header doesn't include <migraphx/bit_cast.hpp>, so including <migraphx/stringutils.hpp> on its own will fail to compile (and the repo’s header-include tests compile each header standalone). Add the missing include (or avoid bit_cast here) to make the header self-contained.
| last.end() - 8, | ||
| [](std::uint64_t acc, std::uint64_t) { return acc >> 8u; }, | ||
| [&](auto) { return bit_length; }); |
There was a problem hiding this comment.
last is bound as a const reference (const auto& last = ...), but transform_partial_sum writes through the output iterator last.end() - 8. For std::array this yields a const_iterator, so this will not compile (and even if it did, it would be writing into a const object). Make last a non-const reference (or select the destination buffer via pointer/index) so the length bytes can be written.
Motivation
This can help support persistent binary caches in the future.
Technical Details
Changelog Category
Add a
CHANGELOG.mdentry for any option other thanNot Applicable