-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
(fix): fixed inconsistenicies in the code and removed debug statements
Signed-off-by: Kannav02 <[email protected]>
- Loading branch information
Showing
1 changed file
with
5 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
e4ccf8d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
===================================
==> Dataset: EDA Corpus
==> Running tests for agent-retriever
/home/luarss/actions-runner/_work/ORAssistant/ORAssistant/evaluation/.venv/lib/python3.12/site-packages/deepeval/init.py:49: UserWarning: You are using deepeval version 1.4.9, however version 1.5.8 is available. You should consider upgrading via the "pip install --upgrade deepeval" command.
warnings.warn(
Fetching 2 files: 0%| | 0/2 [00:00<?, ?it/s]
Fetching 2 files: 50%|█████ | 1/2 [00:00<00:00, 4.16it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 8.31it/s]
Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
Evaluating: 1%| | 1/100 [00:13<22:36, 13.70s/it]
Evaluating: 2%|▏ | 2/100 [00:25<20:26, 12.52s/it]
Evaluating: 3%|▎ | 3/100 [00:37<20:04, 12.41s/it]
Evaluating: 4%|▍ | 4/100 [00:46<17:30, 10.94s/it]
Evaluating: 5%|▌ | 5/100 [00:57<17:22, 10.98s/it]
Evaluating: 6%|▌ | 6/100 [01:08<17:08, 10.94s/it]
Evaluating: 7%|▋ | 7/100 [01:20<17:27, 11.26s/it]
Evaluating: 8%|▊ | 8/100 [01:32<17:34, 11.46s/it]
Evaluating: 9%|▉ | 9/100 [01:44<17:37, 11.62s/it]
Evaluating: 10%|█ | 10/100 [01:54<17:03, 11.37s/it]
Evaluating: 11%|█ | 11/100 [02:05<16:37, 11.21s/it]
Evaluating: 12%|█▏ | 12/100 [02:15<15:41, 10.70s/it]
Evaluating: 13%|█▎ | 13/100 [02:27<16:08, 11.14s/it]
Evaluating: 14%|█▍ | 14/100 [02:38<16:09, 11.27s/it]
Evaluating: 15%|█▌ | 15/100 [02:49<15:41, 11.07s/it]
Evaluating: 16%|█▌ | 16/100 [03:01<15:57, 11.39s/it]
Evaluating: 17%|█▋ | 17/100 [03:13<15:57, 11.54s/it]
Evaluating: 18%|█▊ | 18/100 [03:25<15:46, 11.54s/it]
Evaluating: 19%|█▉ | 19/100 [03:35<15:05, 11.18s/it]
Evaluating: 20%|██ | 20/100 [03:46<14:43, 11.05s/it]
Evaluating: 21%|██ | 21/100 [03:56<14:08, 10.74s/it]
Evaluating: 22%|██▏ | 22/100 [04:08<14:26, 11.11s/it]
Evaluating: 23%|██▎ | 23/100 [04:18<14:05, 10.98s/it]
Evaluating: 24%|██▍ | 24/100 [04:28<13:24, 10.59s/it]
Evaluating: 25%|██▌ | 25/100 [04:38<12:56, 10.35s/it]
Evaluating: 26%|██▌ | 26/100 [04:47<12:29, 10.12s/it]
Evaluating: 27%|██▋ | 27/100 [04:59<12:47, 10.51s/it]
Evaluating: 28%|██▊ | 28/100 [05:10<12:57, 10.80s/it]
Evaluating: 29%|██▉ | 29/100 [05:21<12:47, 10.81s/it]
Evaluating: 30%|███ | 30/100 [05:33<12:56, 11.09s/it]
Evaluating: 31%|███ | 31/100 [05:44<12:51, 11.18s/it]
Evaluating: 32%|███▏ | 32/100 [05:55<12:36, 11.12s/it]
Evaluating: 33%|███▎ | 33/100 [06:07<12:37, 11.31s/it]
Evaluating: 34%|███▍ | 34/100 [06:19<12:47, 11.63s/it]
Evaluating: 35%|███▌ | 35/100 [06:30<12:24, 11.45s/it]
Evaluating: 36%|███▌ | 36/100 [06:41<11:53, 11.14s/it]
Evaluating: 37%|███▋ | 37/100 [06:51<11:26, 10.89s/it]
Evaluating: 38%|███▊ | 38/100 [07:02<11:14, 10.88s/it]
Evaluating: 39%|███▉ | 39/100 [07:12<10:44, 10.57s/it]
Evaluating: 40%|████ | 40/100 [07:22<10:26, 10.44s/it]
Evaluating: 41%|████ | 41/100 [07:34<10:43, 10.90s/it]
Evaluating: 42%|████▏ | 42/100 [07:45<10:30, 10.88s/it]
Evaluating: 43%|████▎ | 43/100 [07:55<10:14, 10.78s/it]
Evaluating: 44%|████▍ | 44/100 [08:07<10:16, 11.01s/it]
Evaluating: 45%|████▌ | 45/100 [08:18<10:14, 11.17s/it]
Evaluating: 46%|████▌ | 46/100 [08:31<10:26, 11.61s/it]
Evaluating: 47%|████▋ | 47/100 [08:45<10:56, 12.39s/it]
Evaluating: 48%|████▊ | 48/100 [08:59<11:01, 12.72s/it]
Evaluating: 49%|████▉ | 49/100 [09:09<10:13, 12.03s/it]
Evaluating: 50%|█████ | 50/100 [09:21<09:59, 12.00s/it]
Evaluating: 51%|█████ | 51/100 [09:33<09:41, 11.87s/it]
Evaluating: 52%|█████▏ | 52/100 [09:45<09:30, 11.89s/it]
Evaluating: 53%|█████▎ | 53/100 [09:55<09:02, 11.55s/it]
Evaluating: 54%|█████▍ | 54/100 [10:07<08:52, 11.57s/it]
Evaluating: 55%|█████▌ | 55/100 [10:19<08:39, 11.55s/it]
Evaluating: 56%|█████▌ | 56/100 [10:29<08:20, 11.37s/it]
Evaluating: 57%|█████▋ | 57/100 [10:40<08:01, 11.19s/it]
Evaluating: 58%|█████▊ | 58/100 [10:51<07:41, 10.98s/it]
Evaluating: 59%|█████▉ | 59/100 [11:01<07:27, 10.91s/it]
Evaluating: 60%|██████ | 60/100 [11:14<07:30, 11.26s/it]
Evaluating: 61%|██████ | 61/100 [11:23<06:59, 10.74s/it]
Evaluating: 62%|██████▏ | 62/100 [11:34<06:48, 10.74s/it]
Evaluating: 63%|██████▎ | 63/100 [11:45<06:37, 10.75s/it]
Evaluating: 64%|██████▍ | 64/100 [11:57<06:39, 11.10s/it]
Evaluating: 65%|██████▌ | 65/100 [12:06<06:11, 10.61s/it]
Evaluating: 66%|██████▌ | 66/100 [12:17<06:03, 10.70s/it]
Evaluating: 67%|██████▋ | 67/100 [12:28<05:57, 10.82s/it]
Evaluating: 68%|██████▊ | 68/100 [12:38<05:37, 10.53s/it]
Evaluating: 69%|██████▉ | 69/100 [12:48<05:24, 10.47s/it]
Evaluating: 70%|███████ | 70/100 [12:58<05:05, 10.17s/it]
Evaluating: 71%|███████ | 71/100 [13:08<04:58, 10.29s/it]
Evaluating: 72%|███████▏ | 72/100 [13:19<04:50, 10.37s/it]
Evaluating: 73%|███████▎ | 73/100 [13:31<04:52, 10.82s/it]
Evaluating: 74%|███████▍ | 74/100 [13:42<04:42, 10.85s/it]
Evaluating: 75%|███████▌ | 75/100 [13:54<04:39, 11.18s/it]
Evaluating: 76%|███████▌ | 76/100 [14:05<04:29, 11.21s/it]
Evaluating: 77%|███████▋ | 77/100 [14:18<04:29, 11.71s/it]
Evaluating: 78%|███████▊ | 78/100 [14:29<04:14, 11.55s/it]
Evaluating: 79%|███████▉ | 79/100 [14:39<03:54, 11.15s/it]
Evaluating: 80%|████████ | 80/100 [14:49<03:37, 10.88s/it]
Evaluating: 81%|████████ | 81/100 [15:00<03:26, 10.86s/it]
Evaluating: 82%|████████▏ | 82/100 [15:11<03:15, 10.85s/it]
Evaluating: 83%|████████▎ | 83/100 [15:20<02:57, 10.42s/it]
Evaluating: 84%|████████▍ | 84/100 [15:31<02:46, 10.40s/it]
Evaluating: 85%|████████▌ | 85/100 [15:40<02:31, 10.13s/it]
Evaluating: 86%|████████▌ | 86/100 [15:52<02:29, 10.69s/it]
Evaluating: 87%|████████▋ | 87/100 [16:03<02:18, 10.66s/it]
Evaluating: 88%|████████▊ | 88/100 [16:15<02:14, 11.20s/it]
Evaluating: 89%|████████▉ | 89/100 [16:26<02:01, 11.02s/it]
Evaluating: 90%|█████████ | 90/100 [16:38<01:54, 11.49s/it]
Evaluating: 91%|█████████ | 91/100 [16:51<01:45, 11.77s/it]
Evaluating: 92%|█████████▏| 92/100 [17:02<01:32, 11.59s/it]
Evaluating: 93%|█████████▎| 93/100 [17:14<01:22, 11.73s/it]
Evaluating: 94%|█████████▍| 94/100 [17:24<01:07, 11.20s/it]
Evaluating: 95%|█████████▌| 95/100 [17:34<00:54, 10.82s/it]
Evaluating: 96%|█████████▌| 96/100 [17:44<00:42, 10.53s/it]
Evaluating: 97%|█████████▋| 97/100 [17:55<00:32, 10.74s/it]
Evaluating: 98%|█████████▊| 98/100 [18:07<00:21, 10.99s/it]
Evaluating: 99%|█████████▉| 99/100 [18:17<00:10, 10.75s/it]
Evaluating: 100%|██████████| 100/100 [18:29<00:00, 11.04s/it]
Evaluating: 100%|██████████| 100/100 [18:29<00:00, 11.09s/it]
✨ You're running DeepEval's latest Contextual Precision Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Contextual Recall Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Hallucination Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
Evaluating 100 test case(s) in parallel: | | 0% (0/100) [Time Taken: 00:00, ?test case/s]
‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's
Evaluating 100 test case(s) in parallel: | | 1% (1/100) [Time Taken: 00:15, 15.07s/test case]
Evaluating 100 test case(s) in parallel: |▏ | 2% (2/100) [Time Taken: 00:15, 6.31s/test case]
Evaluating 100 test case(s) in parallel: |▎ | 3% (3/100) [Time Taken: 00:15, 3.70s/test case]
Evaluating 100 test case(s) in parallel: |▍ | 4% (4/100) [Time Taken: 00:16, 2.55s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 5% (5/100) [Time Taken: 00:16, 1.69s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 6% (6/100) [Time Taken: 00:17, 1.21s/test case]
Evaluating 100 test case(s) in parallel: |▋ | 7% (7/100) [Time Taken: 00:17, 1.15test case/s]
Evaluating 100 test case(s) in parallel: |▉ | 9% (9/100) [Time Taken: 00:17, 2.11test case/s]
Evaluating 100 test case(s) in parallel: |█ | 11% (11/100) [Time Taken: 00:17, 3.32test case/s]
Evaluating 100 test case(s) in parallel: |█▎ | 13% (13/100) [Time Taken: 00:18, 2.99test case/s]
Evaluating 100 test case(s) in parallel: |█▍ | 14% (14/100) [Time Taken: 00:18, 3.15test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 15% (15/100) [Time Taken: 00:18, 2.95test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 16% (16/100) [Time Taken: 00:19, 2.75test case/s]
Evaluating 100 test case(s) in parallel: |█▋ | 17% (17/100) [Time Taken: 00:19, 3.04test case/s]
Evaluating 100 test case(s) in parallel: |█▊ | 18% (18/100) [Time Taken: 00:19, 2.81test case/s]
Evaluating 100 test case(s) in parallel: |██ | 20% (20/100) [Time Taken: 00:20, 4.25test case/s]
Evaluating 100 test case(s) in parallel: |██▍ | 24% (24/100) [Time Taken: 00:20, 5.70test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 25% (25/100) [Time Taken: 00:20, 5.89test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 26% (26/100) [Time Taken: 00:21, 5.00test case/s]
Evaluating 100 test case(s) in parallel: |██▋ | 27% (27/100) [Time Taken: 00:21, 3.43test case/s]
Evaluating 100 test case(s) in parallel: |██▊ | 28% (28/100) [Time Taken: 00:22, 3.45test case/s]
Evaluating 100 test case(s) in parallel: |███ | 31% (31/100) [Time Taken: 00:22, 5.92test case/s]
Evaluating 100 test case(s) in parallel: |███▏ | 32% (32/100) [Time Taken: 00:22, 4.33test case/s]
Evaluating 100 test case(s) in parallel: |███▍ | 34% (34/100) [Time Taken: 00:23, 3.68test case/s]
Evaluating 100 test case(s) in parallel: |███▌ | 36% (36/100) [Time Taken: 00:23, 4.13test case/s]
Evaluating 100 test case(s) in parallel: |███▋ | 37% (37/100) [Time Taken: 00:23, 4.07test case/s]
Evaluating 100 test case(s) in parallel: |███▉ | 39% (39/100) [Time Taken: 00:24, 4.01test case/s]
Evaluating 100 test case(s) in parallel: |████ | 40% (40/100) [Time Taken: 00:24, 3.68test case/s]
Evaluating 100 test case(s) in parallel: |████▎ | 43% (43/100) [Time Taken: 00:25, 3.67test case/s]
Evaluating 100 test case(s) in parallel: |████▌ | 45% (45/100) [Time Taken: 00:25, 4.76test case/s]
Evaluating 100 test case(s) in parallel: |████▊ | 48% (48/100) [Time Taken: 00:25, 6.99test case/s]
Evaluating 100 test case(s) in parallel: |█████ | 50% (50/100) [Time Taken: 00:26, 7.65test case/s]
Evaluating 100 test case(s) in parallel: |█████▏ | 52% (52/100) [Time Taken: 00:26, 5.31test case/s]
Evaluating 100 test case(s) in parallel: |█████▎ | 53% (53/100) [Time Taken: 00:27, 4.38test case/s]
Evaluating 100 test case(s) in parallel: |█████▌ | 56% (56/100) [Time Taken: 00:27, 5.71test case/s]
Evaluating 100 test case(s) in parallel: |█████▊ | 58% (58/100) [Time Taken: 00:27, 6.93test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 60% (60/100) [Time Taken: 00:28, 6.07test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 61% (61/100) [Time Taken: 00:28, 6.41test case/s]
Evaluating 100 test case(s) in parallel: |██████▎ | 63% (63/100) [Time Taken: 00:28, 8.13test case/s]
Evaluating 100 test case(s) in parallel: |██████▌ | 65% (65/100) [Time Taken: 00:28, 6.53test case/s]
Evaluating 100 test case(s) in parallel: |██████▌ | 66% (66/100) [Time Taken: 00:29, 5.72test case/s]
Evaluating 100 test case(s) in parallel: |██████▊ | 68% (68/100) [Time Taken: 00:29, 3.96test case/s]
Evaluating 100 test case(s) in parallel: |██████▉ | 69% (69/100) [Time Taken: 00:29, 4.44test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 70% (70/100) [Time Taken: 00:30, 4.55test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 71% (71/100) [Time Taken: 00:30, 3.18test case/s]
Evaluating 100 test case(s) in parallel: |███████▎ | 73% (73/100) [Time Taken: 00:31, 3.60test case/s]
Evaluating 100 test case(s) in parallel: |███████▍ | 74% (74/100) [Time Taken: 00:31, 4.00test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 75% (75/100) [Time Taken: 00:31, 4.64test case/s]
Evaluating 100 test case(s) in parallel: |███████▋ | 77% (77/100) [Time Taken: 00:32, 3.41test case/s]
Evaluating 100 test case(s) in parallel: |███████▊ | 78% (78/100) [Time Taken: 00:32, 3.87test case/s]
Evaluating 100 test case(s) in parallel: |███████▉ | 79% (79/100) [Time Taken: 00:33, 2.90test case/s]
Evaluating 100 test case(s) in parallel: |████████ | 80% (80/100) [Time Taken: 00:33, 3.02test case/s]
Evaluating 100 test case(s) in parallel: |████████▏ | 82% (82/100) [Time Taken: 00:33, 4.20test case/s]
Evaluating 100 test case(s) in parallel: |████████▎ | 83% (83/100) [Time Taken: 00:34, 2.31test case/s]
Evaluating 100 test case(s) in parallel: |████████▍ | 84% (84/100) [Time Taken: 00:35, 2.36test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 85% (85/100) [Time Taken: 00:35, 2.76test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 86% (86/100) [Time Taken: 00:35, 2.15test case/s]
Evaluating 100 test case(s) in parallel: |████████▋ | 87% (87/100) [Time Taken: 00:36, 2.23test case/s]
Evaluating 100 test case(s) in parallel: |████████▉ | 89% (89/100) [Time Taken: 00:37, 2.22test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 91% (91/100) [Time Taken: 00:37, 3.27test case/s]
Evaluating 100 test case(s) in parallel: |█████████▏| 92% (92/100) [Time Taken: 00:38, 2.06test case/s]
Evaluating 100 test case(s) in parallel: |█████████▎| 93% (93/100) [Time Taken: 00:39, 1.54test case/s]
Evaluating 100 test case(s) in parallel: |█████████▍| 94% (94/100) [Time Taken: 00:40, 1.54test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 95% (95/100) [Time Taken: 00:40, 1.92test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 96% (96/100) [Time Taken: 00:41, 1.63test case/s]
Evaluating 100 test case(s) in parallel: |█████████▋| 97% (97/100) [Time Taken: 00:41, 2.11test case/s]
Evaluating 100 test case(s) in parallel: |█████████▊| 98% (98/100) [Time Taken: 00:45, 1.47s/test case]
Evaluating 100 test case(s) in parallel: |█████████▉| 99% (99/100) [Time Taken: 00:46, 1.40s/test case]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 01:02, 5.55s/test case]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 01:02, 1.61test case/s]
✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
metrics directly on Confident AI instead.
Average Metric Scores:
Contextual Precision 0.7099702380952381
Contextual Recall 0.8604999999999999
Hallucination 0.47328030303030305
Metric Passrates:
Contextual Precision 0.68
Contextual Recall 0.83
Hallucination 0.59
e4ccf8d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
===================================
==> Dataset: EDA Corpus
==> Running tests for agent-retriever
/home/luarss/actions-runner/_work/ORAssistant/ORAssistant/evaluation/.venv/lib/python3.12/site-packages/deepeval/init.py:49: UserWarning: You are using deepeval version 1.4.9, however version 1.5.8 is available. You should consider upgrading via the "pip install --upgrade deepeval" command.
warnings.warn(
Fetching 2 files: 0%| | 0/2 [00:00<?, ?it/s]
Fetching 2 files: 50%|█████ | 1/2 [00:00<00:00, 4.19it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 8.37it/s]
Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
Evaluating: 1%| | 1/100 [00:13<22:12, 13.46s/it]
Evaluating: 2%|▏ | 2/100 [00:24<19:49, 12.14s/it]
Evaluating: 3%|▎ | 3/100 [00:37<20:21, 12.59s/it]
Evaluating: 4%|▍ | 4/100 [00:46<17:46, 11.11s/it]
Evaluating: 5%|▌ | 5/100 [00:58<18:01, 11.39s/it]
Evaluating: 6%|▌ | 6/100 [01:09<17:38, 11.26s/it]
Evaluating: 7%|▋ | 7/100 [01:20<17:31, 11.31s/it]
Evaluating: 8%|▊ | 8/100 [01:32<17:29, 11.41s/it]
Evaluating: 9%|▉ | 9/100 [01:45<18:01, 11.89s/it]
Evaluating: 10%|█ | 10/100 [01:55<17:06, 11.40s/it]
Evaluating: 11%|█ | 11/100 [02:06<16:43, 11.28s/it]
Evaluating: 12%|█▏ | 12/100 [02:16<16:00, 10.92s/it]
Evaluating: 13%|█▎ | 13/100 [02:29<16:26, 11.34s/it]
Evaluating: 14%|█▍ | 14/100 [02:40<16:20, 11.40s/it]
Evaluating: 15%|█▌ | 15/100 [02:53<16:38, 11.74s/it]
Evaluating: 16%|█▌ | 16/100 [03:05<16:28, 11.77s/it]
Evaluating: 17%|█▋ | 17/100 [03:17<16:31, 11.95s/it]
Evaluating: 18%|█▊ | 18/100 [03:29<16:17, 11.92s/it]
Evaluating: 19%|█▉ | 19/100 [03:40<15:57, 11.82s/it]
Evaluating: 20%|██ | 20/100 [03:51<15:25, 11.57s/it]
Evaluating: 21%|██ | 21/100 [04:02<14:44, 11.20s/it]
Evaluating: 22%|██▏ | 22/100 [04:14<14:54, 11.47s/it]
Evaluating: 23%|██▎ | 23/100 [04:25<14:44, 11.48s/it]
Evaluating: 24%|██▍ | 24/100 [04:36<14:14, 11.24s/it]
Evaluating: 25%|██▌ | 25/100 [04:47<13:52, 11.09s/it]
Evaluating: 26%|██▌ | 26/100 [04:57<13:10, 10.68s/it]
Evaluating: 27%|██▋ | 27/100 [05:07<13:05, 10.76s/it]
Evaluating: 28%|██▊ | 28/100 [05:18<12:45, 10.63s/it]
Evaluating: 29%|██▉ | 29/100 [05:29<12:55, 10.93s/it]
Evaluating: 30%|███ | 30/100 [05:41<13:07, 11.25s/it]
Evaluating: 31%|███ | 31/100 [05:52<12:41, 11.03s/it]
Evaluating: 32%|███▏ | 32/100 [06:05<13:04, 11.54s/it]
Evaluating: 33%|███▎ | 33/100 [06:17<13:07, 11.76s/it]
Evaluating: 34%|███▍ | 34/100 [06:28<12:37, 11.48s/it]
Evaluating: 35%|███▌ | 35/100 [06:40<12:43, 11.75s/it]
Evaluating: 36%|███▌ | 36/100 [06:51<12:10, 11.41s/it]
Evaluating: 37%|███▋ | 37/100 [07:02<11:49, 11.27s/it]
Evaluating: 38%|███▊ | 38/100 [07:13<11:34, 11.20s/it]
Evaluating: 39%|███▉ | 39/100 [07:24<11:30, 11.31s/it]
Evaluating: 40%|████ | 40/100 [07:35<11:11, 11.19s/it]
Evaluating: 41%|████ | 41/100 [07:46<11:01, 11.21s/it]
Evaluating: 42%|████▏ | 42/100 [07:58<10:52, 11.25s/it]
Evaluating: 43%|████▎ | 43/100 [08:10<10:59, 11.58s/it]
Evaluating: 44%|████▍ | 44/100 [08:23<11:03, 11.85s/it]
Evaluating: 45%|████▌ | 45/100 [08:35<10:55, 11.93s/it]
Evaluating: 46%|████▌ | 46/100 [08:46<10:39, 11.84s/it]
Evaluating: 47%|████▋ | 47/100 [08:59<10:47, 12.22s/it]
Evaluating: 48%|████▊ | 48/100 [09:11<10:21, 11.94s/it]
Evaluating: 49%|████▉ | 49/100 [09:21<09:48, 11.54s/it]
Evaluating: 50%|█████ | 50/100 [09:38<10:57, 13.16s/it]
Evaluating: 51%|█████ | 51/100 [09:52<10:55, 13.37s/it]
Evaluating: 52%|█████▏ | 52/100 [10:04<10:20, 12.92s/it]
Evaluating: 53%|█████▎ | 53/100 [10:15<09:42, 12.40s/it]
Evaluating: 54%|█████▍ | 54/100 [10:28<09:30, 12.40s/it]
Evaluating: 55%|█████▌ | 55/100 [10:39<08:59, 11.99s/it]
Evaluating: 56%|█████▌ | 56/100 [10:50<08:36, 11.73s/it]
Evaluating: 57%|█████▋ | 57/100 [11:01<08:20, 11.64s/it]
Evaluating: 58%|█████▊ | 58/100 [11:12<07:53, 11.28s/it]
Evaluating: 59%|█████▉ | 59/100 [11:23<07:44, 11.33s/it]
Evaluating: 60%|██████ | 60/100 [11:34<07:30, 11.27s/it]
Evaluating: 61%|██████ | 61/100 [11:46<07:24, 11.40s/it]
Evaluating: 62%|██████▏ | 62/100 [11:57<07:13, 11.41s/it]
Evaluating: 63%|██████▎ | 63/100 [12:08<06:50, 11.08s/it]
Evaluating: 64%|██████▍ | 64/100 [12:17<06:20, 10.57s/it]
Evaluating: 65%|██████▌ | 65/100 [12:26<05:49, 9.98s/it]
Evaluating: 66%|██████▌ | 66/100 [12:36<05:45, 10.17s/it]
Evaluating: 67%|██████▋ | 67/100 [12:48<05:49, 10.59s/it]
Evaluating: 68%|██████▊ | 68/100 [12:59<05:45, 10.78s/it]
Evaluating: 69%|██████▉ | 69/100 [13:10<05:35, 10.81s/it]
Evaluating: 70%|███████ | 70/100 [13:20<05:17, 10.59s/it]
Evaluating: 71%|███████ | 71/100 [13:30<05:03, 10.48s/it]
Evaluating: 72%|███████▏ | 72/100 [13:41<04:54, 10.51s/it]
Evaluating: 73%|███████▎ | 73/100 [13:53<04:54, 10.89s/it]
Evaluating: 74%|███████▍ | 74/100 [14:05<04:52, 11.24s/it]
Evaluating: 75%|███████▌ | 75/100 [14:17<04:47, 11.48s/it]
Evaluating: 76%|███████▌ | 76/100 [14:27<04:29, 11.24s/it]
Evaluating: 77%|███████▋ | 77/100 [14:40<04:25, 11.54s/it]
Evaluating: 78%|███████▊ | 78/100 [14:51<04:14, 11.58s/it]
Evaluating: 79%|███████▉ | 79/100 [15:02<03:56, 11.25s/it]
Evaluating: 80%|████████ | 80/100 [15:13<03:43, 11.18s/it]
Evaluating: 81%|████████ | 81/100 [15:22<03:23, 10.73s/it]
Evaluating: 82%|████████▏ | 82/100 [15:33<03:14, 10.80s/it]
Evaluating: 83%|████████▎ | 83/100 [15:44<03:02, 10.71s/it]
Evaluating: 84%|████████▍ | 84/100 [15:55<02:52, 10.79s/it]
Evaluating: 85%|████████▌ | 85/100 [16:04<02:35, 10.36s/it]
Evaluating: 86%|████████▌ | 86/100 [16:16<02:30, 10.77s/it]
Evaluating: 87%|████████▋ | 87/100 [16:26<02:17, 10.61s/it]
Evaluating: 88%|████████▊ | 88/100 [16:38<02:09, 10.81s/it]
Evaluating: 89%|████████▉ | 89/100 [16:48<01:58, 10.74s/it]
Evaluating: 90%|█████████ | 90/100 [17:01<01:54, 11.48s/it]
Evaluating: 91%|█████████ | 91/100 [17:16<01:52, 12.49s/it]
Evaluating: 92%|█████████▏| 92/100 [17:27<01:35, 11.93s/it]
Evaluating: 93%|█████████▎| 93/100 [17:39<01:23, 11.98s/it]
Evaluating: 94%|█████████▍| 94/100 [17:49<01:08, 11.39s/it]
Evaluating: 95%|█████████▌| 95/100 [18:00<00:56, 11.20s/it]
Evaluating: 96%|█████████▌| 96/100 [18:10<00:43, 10.95s/it]
Evaluating: 97%|█████████▋| 97/100 [18:22<00:33, 11.25s/it]
Evaluating: 98%|█████████▊| 98/100 [18:32<00:21, 10.78s/it]
Evaluating: 99%|█████████▉| 99/100 [18:44<00:11, 11.18s/it]
Evaluating: 100%|██████████| 100/100 [18:57<00:00, 11.72s/it]
Evaluating: 100%|██████████| 100/100 [18:57<00:00, 11.37s/it]
✨ You're running DeepEval's latest Contextual Precision Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Contextual Recall Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Hallucination Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
Evaluating 100 test case(s) in parallel: | | 0% (0/100) [Time Taken: 00:00, ?test case/s]
‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's
Evaluating 100 test case(s) in parallel: | | 1% (1/100) [Time Taken: 00:13, 13.79s/test case]
Evaluating 100 test case(s) in parallel: |▏ | 2% (2/100) [Time Taken: 00:14, 6.02s/test case]
Evaluating 100 test case(s) in parallel: |▎ | 3% (3/100) [Time Taken: 00:14, 3.43s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 5% (5/100) [Time Taken: 00:15, 1.59s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 6% (6/100) [Time Taken: 00:15, 1.20s/test case]
Evaluating 100 test case(s) in parallel: |▊ | 8% (8/100) [Time Taken: 00:15, 1.28test case/s]
Evaluating 100 test case(s) in parallel: |█ | 10% (10/100) [Time Taken: 00:16, 1.89test case/s]
Evaluating 100 test case(s) in parallel: |█ | 11% (11/100) [Time Taken: 00:16, 2.01test case/s]
Evaluating 100 test case(s) in parallel: |█▏ | 12% (12/100) [Time Taken: 00:16, 2.26test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 15% (15/100) [Time Taken: 00:16, 4.06test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 16% (16/100) [Time Taken: 00:17, 4.03test case/s]
Evaluating 100 test case(s) in parallel: |██ | 20% (20/100) [Time Taken: 00:17, 7.53test case/s]
Evaluating 100 test case(s) in parallel: |██▏ | 22% (22/100) [Time Taken: 00:17, 8.13test case/s]
Evaluating 100 test case(s) in parallel: |██▊ | 28% (28/100) [Time Taken: 00:17, 12.95test case/s]
Evaluating 100 test case(s) in parallel: |███ | 30% (30/100) [Time Taken: 00:17, 13.75test case/s]
Evaluating 100 test case(s) in parallel: |███▍ | 34% (34/100) [Time Taken: 00:17, 16.82test case/s]
Evaluating 100 test case(s) in parallel: |███▋ | 37% (37/100) [Time Taken: 00:18, 13.21test case/s]
Evaluating 100 test case(s) in parallel: |███▉ | 39% (39/100) [Time Taken: 00:18, 12.63test case/s]
Evaluating 100 test case(s) in parallel: |████ | 41% (41/100) [Time Taken: 00:18, 12.03test case/s]
Evaluating 100 test case(s) in parallel: |████▎ | 43% (43/100) [Time Taken: 00:19, 10.30test case/s]
Evaluating 100 test case(s) in parallel: |████▌ | 45% (45/100) [Time Taken: 00:19, 10.79test case/s]
Evaluating 100 test case(s) in parallel: |████▊ | 48% (48/100) [Time Taken: 00:19, 13.32test case/s]
Evaluating 100 test case(s) in parallel: |█████ | 50% (50/100) [Time Taken: 00:19, 14.48test case/s]
Evaluating 100 test case(s) in parallel: |█████▎ | 53% (53/100) [Time Taken: 00:19, 16.70test case/s]
Evaluating 100 test case(s) in parallel: |█████▌ | 55% (55/100) [Time Taken: 00:19, 11.56test case/s]
Evaluating 100 test case(s) in parallel: |█████▋ | 57% (57/100) [Time Taken: 00:20, 10.04test case/s]
Evaluating 100 test case(s) in parallel: |█████▉ | 59% (59/100) [Time Taken: 00:20, 10.31test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 61% (61/100) [Time Taken: 00:20, 11.78test case/s]
Evaluating 100 test case(s) in parallel: |██████▎ | 63% (63/100) [Time Taken: 00:20, 11.50test case/s]
Evaluating 100 test case(s) in parallel: |██████▌ | 66% (66/100) [Time Taken: 00:20, 14.24test case/s]
Evaluating 100 test case(s) in parallel: |██████▊ | 68% (68/100) [Time Taken: 00:20, 12.16test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 71% (71/100) [Time Taken: 00:21, 14.54test case/s]
Evaluating 100 test case(s) in parallel: |███████▍ | 74% (74/100) [Time Taken: 00:21, 10.98test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 76% (76/100) [Time Taken: 00:21, 8.90test case/s]
Evaluating 100 test case(s) in parallel: |███████▊ | 78% (78/100) [Time Taken: 00:22, 8.35test case/s]
Evaluating 100 test case(s) in parallel: |████████ | 80% (80/100) [Time Taken: 00:22, 7.37test case/s]
Evaluating 100 test case(s) in parallel: |████████▏ | 82% (82/100) [Time Taken: 00:22, 8.87test case/s]
Evaluating 100 test case(s) in parallel: |████████▍ | 84% (84/100) [Time Taken: 00:22, 8.44test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 86% (86/100) [Time Taken: 00:22, 9.93test case/s]
Evaluating 100 test case(s) in parallel: |████████▊ | 88% (88/100) [Time Taken: 00:23, 8.67test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 90% (90/100) [Time Taken: 00:23, 7.00test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 91% (91/100) [Time Taken: 00:24, 4.14test case/s]
Evaluating 100 test case(s) in parallel: |█████████▏| 92% (92/100) [Time Taken: 00:24, 4.64test case/s]
Evaluating 100 test case(s) in parallel: |█████████▎| 93% (93/100) [Time Taken: 00:25, 2.79test case/s]
Evaluating 100 test case(s) in parallel: |█████████▍| 94% (94/100) [Time Taken: 00:26, 2.20test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 95% (95/100) [Time Taken: 00:26, 2.40test case/s]
Evaluating 100 test case(s) in parallel: |█████████▋| 97% (97/100) [Time Taken: 00:26, 3.38test case/s]
Evaluating 100 test case(s) in parallel: |█████████▊| 98% (98/100) [Time Taken: 00:27, 2.30test case/s]
Evaluating 100 test case(s) in parallel: |█████████▉| 99% (99/100) [Time Taken: 00:27, 2.66test case/s]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:29, 1.41test case/s]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:29, 3.39test case/s]
✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
metrics directly on Confident AI instead.
Average Metric Scores:
Contextual Precision 0.7293988095238094
Contextual Recall 0.858
Hallucination 0.5123170995670996
Metric Passrates:
Contextual Precision 0.68
Contextual Recall 0.81
Hallucination 0.6