Skip to content

Commit

Permalink
Merge branch 'master' into griffin/date-filter-integrated-ffs
Browse files Browse the repository at this point in the history
  • Loading branch information
gtarpenning authored Mar 4, 2025
2 parents 9b58892 + 3a2ee73 commit e0b4e57
Show file tree
Hide file tree
Showing 84 changed files with 3,699 additions and 800 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,7 @@ jobs:
WB_SERVER_HOST: http://wandbservice
WF_CLICKHOUSE_HOST: localhost
WEAVE_SERVER_DISABLE_ECOSYSTEM: 1
DD_TRACE_ENABLED: false
run: |
nox -e "tests-${{ matrix.python-version-major }}.${{ matrix.python-version-minor }}(shard='${{ matrix.nox-shard }}')" -- \
-m "weave_client and not skip_clickhouse_client" \
Expand All @@ -328,6 +329,7 @@ jobs:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
MISTRAL_API_KEY: ${{ secrets.MISTRAL_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
DD_TRACE_ENABLED: false
run: |
nox -e "tests-${{ matrix.python-version-major }}.${{ matrix.python-version-minor }}(shard='${{ matrix.nox-shard }}')"
trace-tests-matrix-check: # This job does nothing and is only used for the branch protection
Expand Down
86 changes: 0 additions & 86 deletions docs/docs/guides/core-types/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,92 +76,6 @@ A `Model` is a combination of data (which can include configuration, trained mod
model.predict('world')
```

## Pairwise evaluation of models

When [scoring](../evaluation/scorers.md) models in a Weave [evaluation](../core-types/evaluations.md), absolute value metrics (e.g. `9/10` for Model A and `8/10` for Model B) are typically harder to assign than than relative ones (e.g. Model A performs better than Model B). _Pairwise evaluation_ allows you to compare the outputs of two models by ranking them relative to each other. This approach is particularly useful when you want to determine which model performs better for subjective tasks such as text generation, summarization, or question answering. With pairwise evaluation, you can obtain a relative preference ranking that reveals which model is best for specific inputs.

The following code sample demonstrates how to implement a pairwise evaluation in Weave by creating a [class-based scorer](../evaluation/scorers.md#class-based-scorers) called `PreferenceScorer`. The `PreferenceScorer` compares two models, `ModelA` and `ModelB`, and returns a relative score of the model outputs based on explicit hints in the input text.

```python
from weave import Model, Evaluation, Scorer, Dataset
from weave.flow.model import ApplyModelError, apply_model_async

class ModelA(Model):
@weave.op
def predict(self, input_text: str):
if "Prefer model A" in input_text:
return {"response": "This is a great answer from Model A"}
return {"response": "Meh, whatever"}

class ModelB(Model):
@weave.op
def predict(self, input_text: str):
if "Prefer model B" in input_text:
return {"response": "This is a thoughtful answer from Model B"}
return {"response": "I don't know"}

class PreferenceScorer(Scorer):
@weave.op
async def _get_other_model_output(self, example: dict) -> Any:
"""Get output from the other model for comparison.
Args:
example: The input example data to run through the other model
Returns:
The output from the other model
"""

other_model_result = await apply_model_async(
self.other_model,
example,
None,
)

if isinstance(other_model_result, ApplyModelError):
return None

return other_model_result.model_output

@weave.op
async def score(self, output: dict, input_text: str) -> dict:
"""Compare the output of the primary model with the other model.
Args:
output (dict): The output from the primary model.
other_output (dict): The output from the other model being compared.
inputs (str): The input text used to generate the outputs.
Returns:
dict: A flat dictionary containing the comparison result and reason.
"""
other_output = await self._get_other_model_output(
{"input_text": inputs}
)
if other_output is None:
return {"primary_is_better": False, "reason": "Other model failed"}

if "Prefer model A" in input_text:
primary_is_better = True
reason = "Model A gave a great answer"
else:
primary_is_better = False
reason = "Model B is preferred for this type of question"

return {"primary_is_better": primary_is_better, "reason": reason}

dataset = Dataset(
rows=[
{"input_text": "Prefer model A: Question 1"}, # Model A wins
{"input_text": "Prefer model A: Question 2"}, # Model A wins
{"input_text": "Prefer model B: Question 3"}, # Model B wins
{"input_text": "Prefer model B: Question 4"}, # Model B wins
]
)

model_a = ModelA()
model_b = ModelB()
pref_scorer = PreferenceScorer(other_model=model_b)
evaluation = Evaluation(dataset=dataset, scorers=[pref_scorer])
evaluation.evaluate(model_a)
```
</TabItem>
<TabItem value="typescript" label="TypeScript">
```plaintext
Expand Down
89 changes: 89 additions & 0 deletions docs/docs/guides/tracking/faqs.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,92 @@ When your program is exiting it may appear to pause while any remaining enqueued
## How is Weave data ingestion calculated?

We define ingested bytes as bytes that we receive, process, and store on your behalf. This includes trace metadata, LLM inputs/outputs, and any other information you explicitly log to Weave, but does not include communication overhead (e.g., HTTP headers) or any other data that is not placed in long-term storage. We count bytes as "ingested" only once at the time they are received and stored.

## What is pairwise evaluation and how do I do it?

When [scoring](../evaluation/scorers.md) models in a Weave [evaluation](../core-types/evaluations.md), absolute value metrics (e.g. `9/10` for Model A and `8/10` for Model B) are typically harder to assign than relative ones (e.g. Model A performs better than Model B). _Pairwise evaluation_ allows you to compare the outputs of two models by ranking them relative to each other. This approach is particularly useful when you want to determine which model performs better for subjective tasks such as text generation, summarization, or question answering. With pairwise evaluation, you can obtain a relative preference ranking that reveals which model is best for specific inputs.

:::important
This approach is a workaround and may change in future releases. We are actively working on a more robust API to support pairwise evaluations. Stay tuned for updates!
:::

The following code sample demonstrates how to implement a pairwise evaluation in Weave by creating a [class-based scorer](../evaluation/scorers.md#class-based-scorers) called `PreferenceScorer`. The `PreferenceScorer` compares two models, `ModelA` and `ModelB`, and returns a relative score of the model outputs based on explicit hints in the input text.

```python
from weave import Model, Evaluation, Scorer, Dataset
from weave.flow.model import ApplyModelError, apply_model_async

class ModelA(Model):
@weave.op
def predict(self, input_text: str):
if "Prefer model A" in input_text:
return {"response": "This is a great answer from Model A"}
return {"response": "Meh, whatever"}

class ModelB(Model):
@weave.op
def predict(self, input_text: str):
if "Prefer model B" in input_text:
return {"response": "This is a thoughtful answer from Model B"}
return {"response": "I don't know"}

class PreferenceScorer(Scorer):
@weave.op
async def _get_other_model_output(self, example: dict) -> Any:
"""Get output from the other model for comparison.
Args:
example: The input example data to run through the other model
Returns:
The output from the other model
"""

other_model_result = await apply_model_async(
self.other_model,
example,
None,
)

if isinstance(other_model_result, ApplyModelError):
return None

return other_model_result.model_output

@weave.op
async def score(self, output: dict, input_text: str) -> dict:
"""Compare the output of the primary model with the other model.
Args:
output (dict): The output from the primary model.
input_text (str): The input text used to generate the outputs.
Returns:
dict: A flat dictionary containing the comparison result and reason.
"""
other_output = await self._get_other_model_output(
{"input_text": input_text}
)
if other_output is None:
return {"primary_is_better": False, "reason": "Other model failed"}

if "Prefer model A" in input_text:
primary_is_better = True
reason = "Model A gave a great answer"
else:
primary_is_better = False
reason = "Model B is preferred for this type of question"

return {"primary_is_better": primary_is_better, "reason": reason}

dataset = Dataset(
rows=[
{"input_text": "Prefer model A: Question 1"}, # Model A wins
{"input_text": "Prefer model A: Question 2"}, # Model A wins
{"input_text": "Prefer model B: Question 3"}, # Model B wins
{"input_text": "Prefer model B: Question 4"}, # Model B wins
]
)

model_a = ModelA()
model_b = ModelB()
pref_scorer = PreferenceScorer(other_model=model_b)
evaluation = Evaluation(dataset=dataset, scorers=[pref_scorer])
evaluation.evaluate(model_a)
```
1 change: 1 addition & 0 deletions noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ def tests(session, shard):
"WB_SERVER_HOST",
"WF_CLICKHOUSE_HOST",
"WEAVE_SERVER_DISABLE_ECOSYSTEM",
"DD_TRACE_ENABLED",
]
}
# Add the GOOGLE_API_KEY environment variable for the "google" shard
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ dependencies = [
# this to a separate package. Note, when that happens, we will need to pull along some of the
#default dependencies as well.
trace_server = [
"ddtrace>=2.7.0",
# BYOB - S3
"boto3>=1.34.0",
# BYOB - Azure
Expand Down
64 changes: 60 additions & 4 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,53 @@
os.environ["WANDB_ERROR_REPORTING"] = "false"


@pytest.fixture(autouse=True)
def disable_datadog():
"""
Disables Datadog logging and tracing for tests.
This prevents Datadog from polluting test logs with messages like
'failed to send, dropping 1 traces to intake at...'
"""
# Save original values to restore later
original_dd_env = os.environ.get("DD_ENV")
original_dd_trace = os.environ.get("DD_TRACE_ENABLED")

# Disable Datadog
os.environ["DD_ENV"] = "none"
os.environ["DD_TRACE_ENABLED"] = "false"

# Silence Datadog loggers
dd_loggers = [
"ddtrace",
"ddtrace.writer",
"ddtrace.api",
"ddtrace.internal",
"datadog",
"datadog.dogstatsd",
"datadog.api",
]

original_levels = {}
for logger_name in dd_loggers:
logger = logging.getLogger(logger_name)
original_levels[logger_name] = logger.level
logger.setLevel(logging.CRITICAL) # Only show critical errors

yield

# Restore original values
if original_dd_env is not None:
os.environ["DD_ENV"] = original_dd_env
elif "DD_ENV" in os.environ:
del os.environ["DD_ENV"]

if original_dd_trace is not None:
os.environ["DD_TRACE_ENABLED"] = original_dd_trace
elif "DD_TRACE_ENABLED" in os.environ:
del os.environ["DD_TRACE_ENABLED"]


def pytest_addoption(parser):
parser.addoption(
"--weave-server",
Expand Down Expand Up @@ -362,14 +409,14 @@ def emit(self, record):
self.log_records[curr_test] = []
self.log_records[curr_test].append(record)

def get_error_logs(self):
def _get_logs(self, levelname: str):
curr_test = get_test_name()
logs = self.log_records.get(curr_test, [])

return [
record
for record in logs
if record.levelname == "ERROR"
if record.levelname == levelname
and record.name.startswith("weave")
# (Tim) For some reason that i cannot figure out, there is some test that
# a) is trying to connect to the PROD trace server
Expand All @@ -386,13 +433,22 @@ def get_error_logs(self):
and not "legacy" in record.name
]

def get_error_logs(self):
return self._get_logs("ERROR")

def get_warning_logs(self):
return self._get_logs("WARNING")


@pytest.fixture
def log_collector():
def log_collector(request):
handler = InMemoryWeaveLogCollector()
logger = logging.getLogger() # Get your specific logger here if needed
logger.addHandler(handler)
logger.setLevel(logging.ERROR) # Set the level to capture all logs
if hasattr(request, "param") and request.param == "warning":
logger.setLevel(logging.WARNING)
else:
logger.setLevel(logging.ERROR)
yield handler
logger.removeHandler(handler) # Clean up after the test

Expand Down
15 changes: 12 additions & 3 deletions tests/trace/test_client_server_caching.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,17 @@ def test_server_caching(client):
gotten_dataset = client.get(ref)
assert caching_server.get_cache_recorder() == {
"hits": 0,
# 1 obj read for the dataset
# get the ref
"misses": 1,
"errors": 0,
"skips": 0,
}
caching_server.reset_cache_recorder()
rows = list(gotten_dataset)
assert caching_server.get_cache_recorder() == {
"hits": 0,
# 1 table read for the rows
# 1 table_query_stats for len(rows)
# 5 images
"misses": 7,
"errors": 0,
Expand All @@ -71,7 +80,7 @@ def test_server_caching(client):
caching_server.reset_cache_recorder()
compare_datasets(client.get(ref), dataset)
assert caching_server.get_cache_recorder() == {
"hits": 8,
"hits": 7,
"misses": 0,
"errors": 0,
"skips": 0,
Expand All @@ -86,7 +95,7 @@ def test_server_caching(client):
"hits": 0,
"misses": 0,
"errors": 0,
"skips": 8,
"skips": 7,
}


Expand Down
5 changes: 5 additions & 0 deletions tests/trace/test_client_trace.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,12 @@ def test_dataset(client):
d = Dataset(rows=[{"a": 5, "b": 6}, {"a": 7, "b": 10}])
ref = weave.publish(d)
d2 = weave.ref(ref.uri()).get()

# This might seem redundant, but it is useful to ensure that the
# dataset can be re-iterated over multiple times and equality is preserved.
assert list(d2.rows) == list(d2.rows)
assert list(d.rows) == list(d2.rows)
assert list(d.rows) == list(d.rows)


def test_trace_server_call_start_and_end(client):
Expand Down
2 changes: 1 addition & 1 deletion tests/trace/test_custom_objs.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from PIL import Image

from weave.trace.custom_objs import decode_custom_obj, encode_custom_obj
from weave.trace.serialization.custom_objs import decode_custom_obj, encode_custom_obj


def test_decode_custom_obj_known_type(client):
Expand Down
Loading

0 comments on commit e0b4e57

Please sign in to comment.