Skip to content
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,21 @@ test-integration: ## Run only integration tests
test-cov: ## Run tests with coverage report
uv run pytest --cov

test-system: ## Run system scale tests (requires running server)
uv run pytest tests/system/ --run-api-tests -v -s

test-system-quick: ## Run quick system scale tests
SCALE_SHORT_MESSAGES=5 SCALE_MEDIUM_MESSAGES=20 SCALE_LONG_MESSAGES=50 \
uv run pytest tests/system/ --run-api-tests -v -s

test-system-production: ## Run production-scale system tests
SCALE_SHORT_MESSAGES=20 SCALE_MEDIUM_MESSAGES=100 SCALE_LONG_MESSAGES=500 \
SCALE_PARALLEL_SESSIONS=10 SCALE_CONCURRENT_UPDATES=20 \
uv run pytest tests/system/ --run-api-tests -v -s

test-travel-agent: ## Run travel agent scenario tests only
uv run pytest tests/system/test_travel_agent_scenarios.py --run-api-tests -v -s

# Running services
server: ## Start the REST API server
uv run agent-memory api
Expand Down
209 changes: 209 additions & 0 deletions SYSTEM_TESTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# System Testing for Production Readiness

This document provides an overview of the system testing harness built to validate the Agent Memory Server's production readiness, specifically for the **Long Conversation Memory** use case.

## Overview

System tests validate end-to-end behavior at production-like scale. They complement unit and integration tests by:

- Testing complete user workflows
- Validating performance at scale
- Verifying behavior under concurrent load
- Ensuring correctness after summarization
- Measuring real-world latencies

## Quick Start

### Prerequisites

1. **Running server** on port 8001
2. **Redis** running and accessible
3. **API keys** set (OPENAI_API_KEY or ANTHROPIC_API_KEY)

### Run Tests

```bash
# Quick smoke test (2-3 minutes)
make test-system-quick

# Standard test (5-10 minutes)
make test-system

# Production-scale test (15-30 minutes)
make test-system-production
```

## What's Being Tested

Based on `long_conversation_memory.md`, the tests validate:

### ✅ Storage Performance
- **O(1) latency**: Conversation storage doesn't degrade with length
- **Consistent performance**: Latency remains stable across operations
- **Parallel sessions**: Multiple sessions don't interfere

### ✅ Summarization
- **Automatic triggering**: Summarization occurs when context window fills
- **Summary quality**: Older messages are properly condensed
- **Context preservation**: Important information is retained

### ✅ Message Integrity
- **Recent messages**: Always preserved regardless of summarization
- **Chronological order**: Messages stay in correct sequence
- **No data loss**: All updates are captured

### ✅ Functionality
- **Session reads**: Work correctly after summarization
- **Memory prompts**: Include relevant context
- **Concurrent updates**: Handled without conflicts

## Test Structure

```
tests/system/
├── test_long_conversation_scale.py # Main test suite
├── README.md # Detailed documentation
├── GETTING_STARTED.md # Quick start guide
├── run_scale_tests.sh # Convenience script
└── __init__.py
```

### Test Classes

1. **TestLongConversationPrepare**: Create conversations of various sizes
2. **TestLongConversationRun**: Test operational scenarios
3. **TestLongConversationCheck**: Validate correctness
4. **TestScaleMetrics**: Comprehensive reporting

See the [architecture diagram](#system-test-architecture) for visual overview.

## Configuration

Control test scale with environment variables:

| Variable | Default | Description |
|----------|---------|-------------|
| `SCALE_SHORT_MESSAGES` | 10 | Messages in short conversations |
| `SCALE_MEDIUM_MESSAGES` | 50 | Messages in medium conversations |
| `SCALE_LONG_MESSAGES` | 200 | Messages in long conversations |
| `SCALE_PARALLEL_SESSIONS` | 5 | Concurrent sessions to create |
| `SCALE_CONCURRENT_UPDATES` | 10 | Simultaneous updates to test |

## Example Output

```
✅ Short conversation (10 msgs) stored in 0.234s
Latency per message: 23.40ms

✅ Medium conversation (50 msgs) stored in 0.891s
Latency per message: 17.82ms

✅ 5 parallel sessions created
Total time: 1.234s
Average session latency: 0.247s

✅ Summarization test completed
Summary created: True
Messages retained: 23 (started with 100)
Context percentage used: 68.5%

✅ Message order preserved
All messages in chronological order: ✓

========================================
✅ SCALE TEST COMPLETE
========================================
```

## Success Criteria

### Performance Benchmarks

- **Short conversations**: < 100ms per message
- **Medium conversations**: < 50ms per message
- **Long conversations**: < 20ms per message
- **Update operations**: < 200ms average
- **Parallel sessions**: Complete without timeouts

### Correctness Requirements

- ✅ All messages in chronological order
- ✅ Recent messages always preserved
- ✅ Summarization triggers when needed
- ✅ Memory prompts include context
- ✅ No data loss during concurrent updates

## Integration with CI/CD

### Pre-Deployment Checklist

1. ✅ Run `make test-system-production`
2. ✅ Verify all tests pass
3. ✅ Review performance metrics
4. ✅ Compare to baseline
5. ✅ Document any regressions
6. ✅ Get approval for deployment

### Continuous Monitoring

After deployment, monitor:
- Message storage latency
- Summarization frequency
- Session read performance
- Update operation latency

Compare production metrics to test baselines.

## Documentation

- **[tests/system/README.md](tests/system/README.md)**: Comprehensive documentation
- **[tests/system/GETTING_STARTED.md](tests/system/GETTING_STARTED.md)**: Quick start guide
- **[long_conversation_memory.md](long_conversation_memory.md)**: Requirements specification

## Troubleshooting

### Common Issues

**Server not reachable**
```bash
uv run agent-memory api --port 8001
```

**No API keys**
```bash
export OPENAI_API_KEY=sk-...
```

**Tests timeout**
- Reduce scale parameters
- Check server/Redis performance
- Review logs for bottlenecks

**Summarization not triggering**
- Increase message count/size
- Reduce context_window_max
- This may be expected behavior

## Next Steps

1. **Review** the test output and architecture
2. **Run** quick smoke test to validate setup
3. **Customize** scale parameters for your use case
4. **Establish** baseline metrics for your environment
5. **Integrate** into your CI/CD pipeline
6. **Monitor** production against baselines

## Support

For detailed information:
- See `tests/system/README.md` for full documentation
- Review test code in `tests/system/test_long_conversation_scale.py`
- Check server logs for debugging
- Consult `long_conversation_memory.md` for requirements

---

**Built with**: Python, pytest, agent-memory-client
**Based on**: long_conversation_memory.md user story
**Purpose**: Production readiness validation

49 changes: 49 additions & 0 deletions long_conversation_memory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Story 1: Long Conversation Memory

## User story
As an agent, I can keep a long conversation in working memory and still get useful recent context after the session grows large.

## Expected functionality
- Long conversations are stored successfully with O(1) latency.
- Older content is summarized into context when needed and configured.
- Recent messages stay available and in order regardless of length.
- Reading the session or building a memory prompt still works after summarization, regardless of length.

## Why it matters
This is the basic "the agent remembers the conversation" experience.

## What we expect to break
- Recent messages getting lost.
- Messages coming back in the wrong order.
- Summaries not appearing, or being empty.
- Session reads becoming inconsistent after many updates.

## Pass criteria
- Recent turns are still there.
- Summary appears when the session gets large.
- The session is still readable and useful afterward.
- How to test it

## Prepare:
- one short conversation
- one medium conversation
- one very long conversation
- one conversation with a few very large messages

## Run:
- repeated updates to one session
- many separate long sessions in parallel
- concurrent updates to the same session

## Check:
- was a summary created?
- are the last few messages still present?
- are the messages in the right order?
- does prompt generation still include the expected context?

## Follow up questions

- how do we define small/medium/large?
- every k token it kicks off the summary mechanism
- consider planning multiple different trips with the same user
- switch conversation in a single thread
57 changes: 57 additions & 0 deletions test_results.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
Using CPython 3.12.8 interpreter at: /Users/robert.shelton/.pyenv/versions/3.12.8/bin/python3
Creating virtual environment at: .venv
Building agent-memory-client @ file:///Users/robert.shelton/Documents/agent-memory-server/agent-memory-client
Building agent-memory-server @ file:///Users/robert.shelton/Documents/agent-memory-server
Downloading botocore (13.8MiB)
Downloading litellm (10.9MiB)
Downloading beartype (1.3MiB)
Downloading cryptography (6.8MiB)
Downloading grpcio (10.5MiB)
Downloading mypy (11.6MiB)
Downloading ruff (12.2MiB)
Built agent-memory-server @ file:///Users/robert.shelton/Documents/agent-memory-server
Built agent-memory-client @ file:///Users/robert.shelton/Documents/agent-memory-server/agent-memory-client
Downloading beartype
Downloading cryptography
Downloading grpcio
Downloading ruff
Downloading litellm
Downloading botocore
Downloading mypy
Installed 157 packages in 823ms
============================= test session starts ==============================
platform darwin -- Python 3.12.8, pytest-9.0.1, pluggy-1.6.0 -- /Users/robert.shelton/Documents/agent-memory-server/.venv/bin/python
cachedir: .pytest_cache
rootdir: /Users/robert.shelton/Documents/agent-memory-server
configfile: pytest.ini (WARNING: ignoring pytest config in pyproject.toml!)
plugins: anyio-4.12.0, xdist-3.8.0, asyncio-1.3.0, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
collecting ... collected 25 items

tests/system/test_long_conversation_scale.py::TestLongConversationPrepare::test_short_conversation SKIPPED [ 4%]
tests/system/test_long_conversation_scale.py::TestLongConversationPrepare::test_medium_conversation SKIPPED [ 8%]
tests/system/test_long_conversation_scale.py::TestLongConversationPrepare::test_long_conversation SKIPPED [ 12%]
tests/system/test_long_conversation_scale.py::TestLongConversationPrepare::test_very_large_messages SKIPPED [ 16%]
tests/system/test_long_conversation_scale.py::TestLongConversationRun::test_repeated_updates_to_session SKIPPED [ 20%]
tests/system/test_long_conversation_scale.py::TestLongConversationRun::test_parallel_long_sessions SKIPPED [ 24%]
tests/system/test_long_conversation_scale.py::TestLongConversationRun::test_concurrent_updates_same_session SKIPPED [ 28%]
tests/system/test_long_conversation_scale.py::TestLongConversationCheck::test_summarization_triggers SKIPPED [ 32%]
tests/system/test_long_conversation_scale.py::TestLongConversationCheck::test_message_order_preserved SKIPPED [ 36%]
tests/system/test_long_conversation_scale.py::TestLongConversationCheck::test_recent_messages_available SKIPPED [ 40%]
tests/system/test_long_conversation_scale.py::TestLongConversationCheck::test_memory_prompt_generation SKIPPED [ 44%]
tests/system/test_long_conversation_scale.py::TestScaleMetrics::test_comprehensive_scale_report SKIPPED [ 48%]
tests/system/test_travel_agent_scenarios.py::TestTravelAgentShortConversations::test_weekend_trip_inquiry SKIPPED [ 52%]
tests/system/test_travel_agent_scenarios.py::TestTravelAgentShortConversations::test_retrieve_and_search_weekend_trip SKIPPED [ 56%]
tests/system/test_travel_agent_scenarios.py::TestTravelAgentMediumConversations::test_family_vacation_planning SKIPPED [ 60%]
tests/system/test_travel_agent_scenarios.py::TestTravelAgentMediumConversations::test_incremental_family_planning SKIPPED [ 64%]
tests/system/test_travel_agent_scenarios.py::TestTravelAgentLongConversations::test_honeymoon_planning_full_journey SKIPPED [ 68%]
tests/system/test_travel_agent_scenarios.py::TestTravelAgentLongConversations::test_honeymoon_with_very_large_itinerary SKIPPED [ 72%]
tests/system/test_travel_agent_scenarios.py::TestTravelAgentConcurrentScenarios::test_multiple_agents_updating_booking SKIPPED [ 76%]
tests/system/test_travel_agent_scenarios.py::TestTravelAgentConcurrentScenarios::test_parallel_client_conversations SKIPPED [ 80%]
tests/system/test_travel_agent_scenarios.py::TestTravelAgentSummarization::test_summarization_with_greece_trip SKIPPED [ 84%]
tests/system/test_travel_agent_scenarios.py::TestReturningClientScenarios::test_three_trips_same_client SKIPPED [ 88%]
tests/system/test_travel_agent_scenarios.py::TestReturningClientScenarios::test_long_term_memory_creation SKIPPED [ 92%]
tests/system/test_travel_agent_scenarios.py::TestReturningClientScenarios::test_context_switching_in_conversation SKIPPED [ 96%]
tests/system/test_travel_agent_scenarios.py::TestReturningClientScenarios::test_preference_consistency_across_trips SKIPPED [100%]

============================= 25 skipped in 0.13s ==============================
Loading
Loading