Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trustworthy ai demo #280

Draft
wants to merge 13 commits into
base: ignite2024
Choose a base branch
from
Draft

Trustworthy ai demo #280

wants to merge 13 commits into from

Conversation

changliu2
Copy link
Collaborator

@changliu2 changliu2 commented Nov 20, 2024

Purpose

Summary of improvements:

  • Robustness improvements (retry + error handling/messaging + parsing). [New] Error messaging: if an agent failed, it will say "Agents failed to produce an article. Examine trace for details. Top-layer error message: xyz. Retrying 1/3 times." See other improvement details from Users/changliu2/ignite2024 hotfix random runs #270.
    4o agents:
  • 2x faster + 97% cheaper 4o-mini agents - 1 min/row (vs baseline 2min/row) while evaluation looks less perfect but comparable (a cost-latency-perf story by itself) See snapshots for comparison:
4o-agent-quality 4o-mini agents: 4o-mini-agent-quality

Large-scale tests:

  • For robustness test, I ran the agent app 100 times over 11 data points. It stopped short at 46 runs due to az login credentials on expiring on codespace in this 3+ hour job, so a total 3x11x46~=1500 agent calls), there are 2 errors triggered out of 1500 calls, or a 0.001% error rate. Both errors were json object returned by llm not properly escaped. It looks like this: Agents failed to produce an article. Examine trace for details. Top layer error message: OpenAI API hits exception: JSONDecodeError: Expecting property name enclosed in double quotes: line 3 column 522 (char 556). See run log attached for details (search for "Agents failed").

  • For latency test, this 3+ hour job averaged about half a min per data point in codespace (faster than my 1 min estimate on my laptop). Estimates were based on the timestamp when the code first errored out. See run log attached for details.

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

  • Get the code
git clone [[repo-address]](https://github.com/Azure-Samples/contoso-creative-writer.git)
cd src/api
git checkout [branch-name]
pip install -r requirements
python -m evaluate.evaluate
  • Test the code

What to Check

Verify that the following are valid

  • ...

Other Information

@changliu2 changliu2 marked this pull request as draft November 20, 2024 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant