redis · rbs333 · Mar 12, 2026 · Mar 12, 2026
diff --git a/Makefile b/Makefile
@@ -40,6 +40,21 @@ test-integration:  ## Run only integration tests
 test-cov:  ## Run tests with coverage report
 	uv run pytest --cov
 
+test-system:  ## Run system scale tests (requires running server)
+	uv run pytest tests/system/ --run-api-tests -v -s
+
+test-system-quick:  ## Run quick system scale tests
+	SCALE_SHORT_MESSAGES=5 SCALE_MEDIUM_MESSAGES=20 SCALE_LONG_MESSAGES=50 \
+	uv run pytest tests/system/ --run-api-tests -v -s
+
+test-system-production:  ## Run production-scale system tests
+	SCALE_SHORT_MESSAGES=20 SCALE_MEDIUM_MESSAGES=100 SCALE_LONG_MESSAGES=500 \
+	SCALE_PARALLEL_SESSIONS=10 SCALE_CONCURRENT_UPDATES=20 \
+	uv run pytest tests/system/ --run-api-tests -v -s
+
+test-travel-agent:  ## Run travel agent scenario tests only
+	uv run pytest tests/system/test_travel_agent_scenarios.py --run-api-tests -v -s
+
 # Running services
 server:  ## Start the REST API server
 	uv run agent-memory api

diff --git a/SYSTEM_TESTING.md b/SYSTEM_TESTING.md
@@ -0,0 +1,209 @@
+# System Testing for Production Readiness
+
+This document provides an overview of the system testing harness built to validate the Agent Memory Server's production readiness, specifically for the **Long Conversation Memory** use case.
+
+## Overview
+
+System tests validate end-to-end behavior at production-like scale. They complement unit and integration tests by:
+
+- Testing complete user workflows
+- Validating performance at scale
+- Verifying behavior under concurrent load
+- Ensuring correctness after summarization
+- Measuring real-world latencies
+
+## Quick Start
+
+### Prerequisites
+
+1. **Running server** on port 8001
+2. **Redis** running and accessible
+3. **API keys** set (OPENAI_API_KEY or ANTHROPIC_API_KEY)
+
+### Run Tests
+
+```bash
+# Quick smoke test (2-3 minutes)
+make test-system-quick
+
+# Standard test (5-10 minutes)
+make test-system
+
+# Production-scale test (15-30 minutes)
+make test-system-production
+```
+
+## What's Being Tested
+
+Based on `long_conversation_memory.md`, the tests validate:
+
+### ✅ Storage Performance
+- **O(1) latency**: Conversation storage doesn't degrade with length
+- **Consistent performance**: Latency remains stable across operations
+- **Parallel sessions**: Multiple sessions don't interfere
+
+### ✅ Summarization
+- **Automatic triggering**: Summarization occurs when context window fills
+- **Summary quality**: Older messages are properly condensed
+- **Context preservation**: Important information is retained
+
+### ✅ Message Integrity
+- **Recent messages**: Always preserved regardless of summarization
+- **Chronological order**: Messages stay in correct sequence
+- **No data loss**: All updates are captured
+
+### ✅ Functionality
+- **Session reads**: Work correctly after summarization
+- **Memory prompts**: Include relevant context
+- **Concurrent updates**: Handled without conflicts
+
+## Test Structure
+
+```
+tests/system/
+├── test_long_conversation_scale.py  # Main test suite
+├── README.md                        # Detailed documentation
+├── GETTING_STARTED.md              # Quick start guide
+├── run_scale_tests.sh              # Convenience script
+└── __init__.py
+```
+
+### Test Classes
+
+1. **TestLongConversationPrepare**: Create conversations of various sizes
+2. **TestLongConversationRun**: Test operational scenarios
+3. **TestLongConversationCheck**: Validate correctness
+4. **TestScaleMetrics**: Comprehensive reporting
+
+See the [architecture diagram](#system-test-architecture) for visual overview.
+
+## Configuration
+
+Control test scale with environment variables:
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `SCALE_SHORT_MESSAGES` | 10 | Messages in short conversations |
+| `SCALE_MEDIUM_MESSAGES` | 50 | Messages in medium conversations |
+| `SCALE_LONG_MESSAGES` | 200 | Messages in long conversations |
+| `SCALE_PARALLEL_SESSIONS` | 5 | Concurrent sessions to create |
+| `SCALE_CONCURRENT_UPDATES` | 10 | Simultaneous updates to test |
+
+## Example Output
+
+```
+✅ Short conversation (10 msgs) stored in 0.234s
+   Latency per message: 23.40ms
+
+✅ Medium conversation (50 msgs) stored in 0.891s
+   Latency per message: 17.82ms
+
+✅ 5 parallel sessions created
+   Total time: 1.234s
+   Average session latency: 0.247s
+
+✅ Summarization test completed
+   Summary created: True
+   Messages retained: 23 (started with 100)
+   Context percentage used: 68.5%
+
+✅ Message order preserved
+   All messages in chronological order: ✓
+
+========================================
+✅ SCALE TEST COMPLETE
+========================================
+```
+
+## Success Criteria
+
+### Performance Benchmarks
+
+- **Short conversations**: < 100ms per message
+- **Medium conversations**: < 50ms per message
+- **Long conversations**: < 20ms per message
+- **Update operations**: < 200ms average
+- **Parallel sessions**: Complete without timeouts
+
+### Correctness Requirements
+
+- ✅ All messages in chronological order
+- ✅ Recent messages always preserved
+- ✅ Summarization triggers when needed
+- ✅ Memory prompts include context
+- ✅ No data loss during concurrent updates
+
+## Integration with CI/CD
+
+### Pre-Deployment Checklist
+
+1. ✅ Run `make test-system-production`
+2. ✅ Verify all tests pass
+3. ✅ Review performance metrics
+4. ✅ Compare to baseline
+5. ✅ Document any regressions
+6. ✅ Get approval for deployment
+
+### Continuous Monitoring
+
+After deployment, monitor:
+- Message storage latency
+- Summarization frequency
+- Session read performance
+- Update operation latency
+
+Compare production metrics to test baselines.
+
+## Documentation
+
+- **[tests/system/README.md](tests/system/README.md)**: Comprehensive documentation
+- **[tests/system/GETTING_STARTED.md](tests/system/GETTING_STARTED.md)**: Quick start guide
+- **[long_conversation_memory.md](long_conversation_memory.md)**: Requirements specification
+
+## Troubleshooting
+
+### Common Issues
+
+**Server not reachable**
+```bash
+uv run agent-memory api --port 8001
+```
+
+**No API keys**
+```bash
+export OPENAI_API_KEY=sk-...
+```
+
+**Tests timeout**
+- Reduce scale parameters
+- Check server/Redis performance
+- Review logs for bottlenecks
+
+**Summarization not triggering**
+- Increase message count/size
+- Reduce context_window_max
+- This may be expected behavior
+
+## Next Steps
+
+1. **Review** the test output and architecture
+2. **Run** quick smoke test to validate setup
+3. **Customize** scale parameters for your use case
+4. **Establish** baseline metrics for your environment
+5. **Integrate** into your CI/CD pipeline
+6. **Monitor** production against baselines
+
+## Support
+
+For detailed information:
+- See `tests/system/README.md` for full documentation
+- Review test code in `tests/system/test_long_conversation_scale.py`
+- Check server logs for debugging
+- Consult `long_conversation_memory.md` for requirements
+
+---
+
+**Built with**: Python, pytest, agent-memory-client  
+**Based on**: long_conversation_memory.md user story  
+**Purpose**: Production readiness validation
+
diff --git a/long_conversation_memory.md b/long_conversation_memory.md
@@ -0,0 +1,49 @@
+# Story 1: Long Conversation Memory
+
+## User story
+As an agent, I can keep a long conversation in working memory and still get useful recent context after the session grows large.
+
+## Expected functionality
+- Long conversations are stored successfully with O(1) latency.
+- Older content is summarized into context when needed and configured.
+- Recent messages stay available and in order regardless of length.
+- Reading the session or building a memory prompt still works after summarization, regardless of length.
+
+## Why it matters
+This is the basic "the agent remembers the conversation" experience.
+
+## What we expect to break
+- Recent messages getting lost.
+- Messages coming back in the wrong order.
+- Summaries not appearing, or being empty.
+- Session reads becoming inconsistent after many updates.
+
+## Pass criteria
+- Recent turns are still there.
+- Summary appears when the session gets large.
+- The session is still readable and useful afterward.
+- How to test it
+
+## Prepare:
+- one short conversation
+- one medium conversation
+- one very long conversation
+- one conversation with a few very large messages
+
+## Run:
+- repeated updates to one session
+- many separate long sessions in parallel
+- concurrent updates to the same session
+
+## Check:
+- was a summary created?
+- are the last few messages still present?
+- are the messages in the right order?
+- does prompt generation still include the expected context?
+
+## Follow up questions
+
+- how do we define small/medium/large?
+- every k token it kicks off the summary mechanism
+- consider planning multiple different trips with the same user
+- switch conversation in a single thread
diff --git a/test_results.txt b/test_results.txt
@@ -0,0 +1,57 @@
+Using CPython 3.12.8 interpreter at: /Users/robert.shelton/.pyenv/versions/3.12.8/bin/python3
+Creating virtual environment at: .venv
+   Building agent-memory-client @ file:///Users/robert.shelton/Documents/agent-memory-server/agent-memory-client
+   Building agent-memory-server @ file:///Users/robert.shelton/Documents/agent-memory-server
+Downloading botocore (13.8MiB)
+Downloading litellm (10.9MiB)
+Downloading beartype (1.3MiB)
+Downloading cryptography (6.8MiB)
+Downloading grpcio (10.5MiB)
+Downloading mypy (11.6MiB)
+Downloading ruff (12.2MiB)
+      Built agent-memory-server @ file:///Users/robert.shelton/Documents/agent-memory-server
+      Built agent-memory-client @ file:///Users/robert.shelton/Documents/agent-memory-server/agent-memory-client
+ Downloading beartype
+ Downloading cryptography
+ Downloading grpcio
+ Downloading ruff
+ Downloading litellm
+ Downloading botocore
+ Downloading mypy
+Installed 157 packages in 823ms
+============================= test session starts ==============================
+platform darwin -- Python 3.12.8, pytest-9.0.1, pluggy-1.6.0 -- /Users/robert.shelton/Documents/agent-memory-server/.venv/bin/python
+cachedir: .pytest_cache
+rootdir: /Users/robert.shelton/Documents/agent-memory-server
+configfile: pytest.ini (WARNING: ignoring pytest config in pyproject.toml!)
+plugins: anyio-4.12.0, xdist-3.8.0, asyncio-1.3.0, cov-7.0.0
+asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
+collecting ... collected 25 items
+
+tests/system/test_long_conversation_scale.py::TestLongConversationPrepare::test_short_conversation SKIPPED [  4%]
+tests/system/test_long_conversation_scale.py::TestLongConversationPrepare::test_medium_conversation SKIPPED [  8%]
+tests/system/test_long_conversation_scale.py::TestLongConversationPrepare::test_long_conversation SKIPPED [ 12%]
+tests/system/test_long_conversation_scale.py::TestLongConversationPrepare::test_very_large_messages SKIPPED [ 16%]
+tests/system/test_long_conversation_scale.py::TestLongConversationRun::test_repeated_updates_to_session SKIPPED [ 20%]
+tests/system/test_long_conversation_scale.py::TestLongConversationRun::test_parallel_long_sessions SKIPPED [ 24%]
+tests/system/test_long_conversation_scale.py::TestLongConversationRun::test_concurrent_updates_same_session SKIPPED [ 28%]
+tests/system/test_long_conversation_scale.py::TestLongConversationCheck::test_summarization_triggers SKIPPED [ 32%]
+tests/system/test_long_conversation_scale.py::TestLongConversationCheck::test_message_order_preserved SKIPPED [ 36%]
+tests/system/test_long_conversation_scale.py::TestLongConversationCheck::test_recent_messages_available SKIPPED [ 40%]
+tests/system/test_long_conversation_scale.py::TestLongConversationCheck::test_memory_prompt_generation SKIPPED [ 44%]
+tests/system/test_long_conversation_scale.py::TestScaleMetrics::test_comprehensive_scale_report SKIPPED [ 48%]
+tests/system/test_travel_agent_scenarios.py::TestTravelAgentShortConversations::test_weekend_trip_inquiry SKIPPED [ 52%]
+tests/system/test_travel_agent_scenarios.py::TestTravelAgentShortConversations::test_retrieve_and_search_weekend_trip SKIPPED [ 56%]
+tests/system/test_travel_agent_scenarios.py::TestTravelAgentMediumConversations::test_family_vacation_planning SKIPPED [ 60%]
+tests/system/test_travel_agent_scenarios.py::TestTravelAgentMediumConversations::test_incremental_family_planning SKIPPED [ 64%]
+tests/system/test_travel_agent_scenarios.py::TestTravelAgentLongConversations::test_honeymoon_planning_full_journey SKIPPED [ 68%]
+tests/system/test_travel_agent_scenarios.py::TestTravelAgentLongConversations::test_honeymoon_with_very_large_itinerary SKIPPED [ 72%]
+tests/system/test_travel_agent_scenarios.py::TestTravelAgentConcurrentScenarios::test_multiple_agents_updating_booking SKIPPED [ 76%]
+tests/system/test_travel_agent_scenarios.py::TestTravelAgentConcurrentScenarios::test_parallel_client_conversations SKIPPED [ 80%]
+tests/system/test_travel_agent_scenarios.py::TestTravelAgentSummarization::test_summarization_with_greece_trip SKIPPED [ 84%]
+tests/system/test_travel_agent_scenarios.py::TestReturningClientScenarios::test_three_trips_same_client SKIPPED [ 88%]
+tests/system/test_travel_agent_scenarios.py::TestReturningClientScenarios::test_long_term_memory_creation SKIPPED [ 92%]
+tests/system/test_travel_agent_scenarios.py::TestReturningClientScenarios::test_context_switching_in_conversation SKIPPED [ 96%]
+tests/system/test_travel_agent_scenarios.py::TestReturningClientScenarios::test_preference_consistency_across_trips SKIPPED [100%]
+
+============================= 25 skipped in 0.13s ==============================