diff --git a/CONTRIBUTE.md b/CONTRIBUTE.md index d353c9d116..d7e4835115 100644 --- a/CONTRIBUTE.md +++ b/CONTRIBUTE.md @@ -183,37 +183,37 @@ Below is a list of resources that can be useful for development and debugging. ## Docs (Docsite)[https://chronon.ai] -(doc directory)[https://github.com/airbnb/chronon/tree/master/docs/source] +(doc directory)[https://github.com/airbnb/chronon/tree/main/docs/source] (Code of conduct)[TODO] ## Links: (pip project)[https://pypi.org/project/chronon-ai/] -(maven central)[https://mvnrepository.com/artifact/ai.chronon/]: (publishing)[https://github.com/airbnb/chronon/blob/master/devnotes.md#publishing-all-the-artifacts-of-chronon] -(Docsite: publishing)[https://github.com/airbnb/chronon/blob/master/devnotes.md#chronon-artifacts-publish-process] +(maven central)[https://mvnrepository.com/artifact/ai.chronon/]: (publishing)[https://github.com/airbnb/chronon/blob/main/devnotes.md#publishing-all-the-artifacts-of-chronon] +(Docsite: publishing)[https://github.com/airbnb/chronon/blob/main/devnotes.md#chronon-artifacts-publish-process] ## Code Pointers -Api - (Thrift)[https://github.com/airbnb/chronon/blob/master/api/thrift/api.thrift#L180], (Python)[https://github.com/airbnb/chronon/blob/master/api/py/ai/chronon/group_by.py] -(CLI driver entry point for job launching.)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/Driver.scala] +Api - (Thrift)[https://github.com/airbnb/chronon/blob/main/api/thrift/api.thrift#L180], (Python)[https://github.com/airbnb/chronon/blob/main/api/py/ai/chronon/group_by.py] +(CLI driver entry point for job launching.)[https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/Driver.scala] **Offline flows that produce hive tables or file output** -(GroupBy)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/GroupBy.scala] -(Staging Query)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/StagingQuery.scala] -(Join backfills)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/Join.scala] -(Metadata Export)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/MetadataExporter.scala] +(GroupBy)[https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/GroupBy.scala] +(Staging Query)[https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/StagingQuery.scala] +(Join backfills)[https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/Join.scala] +(Metadata Export)[https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/MetadataExporter.scala] Online flows that update and read data & metadata from the kvStore -(GroupBy window tail upload )[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/GroupByUpload.scala] -(Streaming window head upload)[https://github.com/airbnb/chronon/blob/master/spark/src/main/scala/ai/chronon/spark/streaming/GroupBy.scala] -(Fetching)[https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/Fetcher.scala] +(GroupBy window tail upload )[https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/GroupByUpload.scala] +(Streaming window head upload)[https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/streaming/GroupBy.scala] +(Fetching)[https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/Fetcher.scala] Aggregations -(time based aggregations)[https://github.com/airbnb/chronon/blob/master/aggregator/src/main/scala/ai/chronon/aggregator/base/TimedAggregators.scala] -(time independent aggregations)[https://github.com/airbnb/chronon/blob/master/aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala] -(integration point with rest of chronon)[https://github.com/airbnb/chronon/blob/master/aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala#L223] -(Windowing)[https://github.com/airbnb/chronon/tree/master/aggregator/src/main/scala/ai/chronon/aggregator/windowing] +(time based aggregations)[https://github.com/airbnb/chronon/blob/main/aggregator/src/main/scala/ai/chronon/aggregator/base/TimedAggregators.scala] +(time independent aggregations)[https://github.com/airbnb/chronon/blob/main/aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala] +(integration point with rest of chronon)[https://github.com/airbnb/chronon/blob/main/aggregator/src/main/scala/ai/chronon/aggregator/row/ColumnAggregator.scala#L223] +(Windowing)[https://github.com/airbnb/chronon/tree/main/aggregator/src/main/scala/ai/chronon/aggregator/windowing] **Testing** -(Testing - sbt commands)[https://github.com/airbnb/chronon/blob/master/devnotes.md#testing] +(Testing - sbt commands)[https://github.com/airbnb/chronon/blob/main/devnotes.md#testing] (Automated testing - circle-ci pipelines)[https://app.circleci.com/pipelines/github/airbnb/chronon] -(Dev Setup)[https://github.com/airbnb/chronon/blob/master/devnotes.md#prerequisites] +(Dev Setup)[https://github.com/airbnb/chronon/blob/main/devnotes.md#prerequisites] diff --git a/README.md b/README.md index 65609bdfc4..8d4a1f11a2 100644 --- a/README.md +++ b/README.md @@ -59,7 +59,7 @@ Does not include: ## Setup -To get started with the Chronon, all you need to do is download the [docker-compose.yml](https://github.com/airbnb/chronon/blob/master/docker-compose.yml) file and run it locally: +To get started with the Chronon, all you need to do is download the [docker-compose.yml](https://github.com/airbnb/chronon/blob/main/docker-compose.yml) file and run it locally: ```bash curl -o docker-compose.yml https://chronon.ai/docker-compose.yml @@ -74,7 +74,7 @@ In this example, let's assume that we're a large online retailer, and we've dete ## Raw data sources -Fabricated raw data is included in the [data](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/data) directory. It includes four tables: +Fabricated raw data is included in the [data](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/data) directory. It includes four tables: 1. Users - includes basic information about users such as account created date; modeled as a batch data source that updates daily 2. Purchases - a log of all purchases by users; modeled as a log table with a streaming (i.e. Kafka) event-bus counterpart @@ -141,11 +141,11 @@ v1 = GroupBy( ) ``` -See the whole code file here: [purchases GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/purchases.py). This is also in your docker image. We'll be running computation for it and the other GroupBys in [Step 3 - Backfilling Data](#step-3---backfilling-data). +See the whole code file here: [purchases GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/purchases.py). This is also in your docker image. We'll be running computation for it and the other GroupBys in [Step 3 - Backfilling Data](#step-3---backfilling-data). **Feature set 2: Returns data features** -We perform a similar set of aggregations on returns data in the [returns GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/returns.py). The code is not included here because it looks similar to the above example. +We perform a similar set of aggregations on returns data in the [returns GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/returns.py). The code is not included here because it looks similar to the above example. **Feature set 3: User data features** @@ -167,7 +167,7 @@ v1 = GroupBy( ) ``` -Taken from the [users GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/users.py). +Taken from the [users GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/users.py). ### Step 2 - Join the features together @@ -200,7 +200,7 @@ v1 = Join( ) ``` -Taken from the [training_set Join](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/joins/quickstart/training_set.py). +Taken from the [training_set Join](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/joins/quickstart/training_set.py). The `left` side of the join is what defines the timestamps and primary keys for the backfill (notice that it is built on top of the `checkout` event, as dictated by our use case). @@ -370,7 +370,7 @@ Using chronon for your feature engineering work simplifies and improves your ML 4. Chronon exposes easy endpoints for feature fetching. 5. Consistency is guaranteed and measurable. -For a more detailed view into the benefits of using Chronon, see [Benefits of Chronon documentation](https://github.com/airbnb/chronon/tree/master?tab=readme-ov-file#benefits-of-chronon-over-other-approaches). +For a more detailed view into the benefits of using Chronon, see [Benefits of Chronon documentation](https://github.com/airbnb/chronon/tree/main?tab=readme-ov-file#benefits-of-chronon-over-other-approaches). # Benefits of Chronon over other approaches diff --git a/aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala b/aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala index aa9d2979f7..f2501804a0 100644 --- a/aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala +++ b/aggregator/src/main/scala/ai/chronon/aggregator/base/SimpleAggregators.scala @@ -411,7 +411,7 @@ class FrequentItems[T: FrequentItemsFriendly](val mapSize: Int, val errorType: E // See: Back to the future: an even more nearly optimal cardinality estimation algorithm, 2017 // https://arxiv.org/abs/1708.06839 // refer to the chart here to tune your sketch size with lgK -// https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/cpc/CpcSketch.java#L180 +// https://github.com/apache/incubator-datasketches-java/blob/main/src/main/java/org/apache/datasketches/cpc/CpcSketch.java#L180 // default is about 1200 bytes class ApproxDistinctCount[Input: CpcFriendly](lgK: Int = 8) extends SimpleAggregator[Input, CpcSketch, Long] { override def outputType: DataType = LongType diff --git a/aggregator/src/main/scala/ai/chronon/aggregator/windowing/TwoStackLiteAggregationBuffer.scala b/aggregator/src/main/scala/ai/chronon/aggregator/windowing/TwoStackLiteAggregationBuffer.scala index a9ad3f2bc4..bd039f8752 100644 --- a/aggregator/src/main/scala/ai/chronon/aggregator/windowing/TwoStackLiteAggregationBuffer.scala +++ b/aggregator/src/main/scala/ai/chronon/aggregator/windowing/TwoStackLiteAggregationBuffer.scala @@ -22,7 +22,7 @@ import java.util case class BankersEntry[IR](var value: IR, ts: Long) -// ported from: https://github.com/IBM/sliding-window-aggregators/blob/master/rust/src/two_stacks_lite/mod.rs with some +// ported from: https://github.com/IBM/sliding-window-aggregators/blob/main/rust/src/two_stacks_lite/mod.rs with some // modification to work with simple aggregator class TwoStackLiteAggregationBuffer[Input, IR >: Null, Output >: Null](aggregator: SimpleAggregator[Input, IR, Output], maxSize: Int) { diff --git a/airflow/helpers.py b/airflow/helpers.py index 34e262beb0..15dabc64ae 100644 --- a/airflow/helpers.py +++ b/airflow/helpers.py @@ -66,7 +66,7 @@ def safe_part(p): return re.sub("[^A-Za-z0-9_]", "__", safe_name) -# https://github.com/airbnb/chronon/blob/master/api/src/main/scala/ai/chronon/api/Extensions.scala +# https://github.com/airbnb/chronon/blob/main/api/src/main/scala/ai/chronon/api/Extensions.scala def sanitize(name): return re.sub("[^a-zA-Z0-9_]", "_", name) diff --git a/api/py/ai/chronon/group_by.py b/api/py/ai/chronon/group_by.py index 3b95bbba08..18194f60f4 100644 --- a/api/py/ai/chronon/group_by.py +++ b/api/py/ai/chronon/group_by.py @@ -61,7 +61,7 @@ class Operation: APPROX_UNIQUE_COUNT = ttypes.Operation.APPROX_UNIQUE_COUNT # refer to the chart here to tune your sketch size with lgK # default is 8 - # https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/cpc/CpcSketch.java#L180 + # https://github.com/apache/incubator-datasketches-java/blob/main/src/main/java/org/apache/datasketches/cpc/CpcSketch.java#L180 APPROX_UNIQUE_COUNT_LGK = collector(ttypes.Operation.APPROX_UNIQUE_COUNT) UNIQUE_COUNT = ttypes.Operation.UNIQUE_COUNT COUNT = ttypes.Operation.COUNT diff --git a/api/py/setup.py b/api/py/setup.py index 208b8152a3..f38e6c7539 100644 --- a/api/py/setup.py +++ b/api/py/setup.py @@ -27,7 +27,7 @@ __version__ = "local" -__branch__ = "master" +__branch__ = "main" def get_version(): version_str = os.environ.get("CHRONON_VERSION_STR", __version__) branch_str = os.environ.get("CHRONON_BRANCH_STR", __branch__) diff --git a/build.sbt b/build.sbt index b0db70fd1a..a35443db0f 100644 --- a/build.sbt +++ b/build.sbt @@ -94,8 +94,8 @@ git.gitTagToVersionNumber := { tag: String => // Git plugin will automatically add SNAPSHOT for dirty workspaces so remove it to avoid duplication. val versionStr = if (git.gitUncommittedChanges.value) version.value.replace("-SNAPSHOT", "") else version.value val branchTag = git.gitCurrentBranch.value.replace("/", "-") - if (branchTag == "master") { - // For master branches, we tag the packages as - + if (branchTag == "main" || branchTag = "master") { + // For main branches, we tag the packages as - Some(s"${versionStr}") } else { // For user branches, we tag the packages as -- diff --git a/build.sh b/build.sh index f403a035a1..8a1ca7abbd 100755 --- a/build.sh +++ b/build.sh @@ -7,8 +7,8 @@ set -euxo pipefail BRANCH="$(git rev-parse --abbrev-ref HEAD)" -if [[ "$BRANCH" != "master" ]]; then - echo "$(tput bold) You are not on master!" +if [[ "$BRANCH" != "main" ]]; then + echo "$(tput bold) You are not on main branch!" echo "$(tput sgr0) Are you sure you want to release? (y to continue)" read response if [[ "$response" != "y" ]]; then diff --git a/devnotes.md b/devnotes.md index d862558adb..88bbe7f82f 100644 --- a/devnotes.md +++ b/devnotes.md @@ -104,7 +104,7 @@ sbt python_api Note: This will create the artifacts with the version specific naming specified under `version.sbt` ```text -Builds on master will result in: +Builds on main branch will result in: -.jar [JARs] chronon_2.11-0.7.0-SNAPSHOT.jar [Python] chronon-ai-0.7.0-SNAPSHOT.tar.gz @@ -227,15 +227,15 @@ This command will take into the account of `version.sbt` and handles a series of 2. Select "refresh" and "release" 3. Wait for 30 mins to sync to [maven](https://repo1.maven.org/maven2/) or [sonatype UI](https://search.maven.org/search?q=g:ai.chronon) 4. Push the local release commits (DO NOT SQUASH), and the new tag created from step 1 to Github. - 1. chronon repo disallow push to master directly, so instead push commits to a branch `git push origin master:your-name--release-xxx` + 1. chronon repo disallow push to main branch directly, so instead push commits to a branch `git push origin main:your-name--release-xxx` 2. your PR should contain exactly two commits, 1 setting the release version, 1 setting the new snapshot version. 3. make sure to use **Rebase pull request** instead of the regular Merge or Squash options when merging the PR. -5. Push release tag to master branch +5. Push release tag to main branch 1. tag new version to release commit `Setting version to 0.0.xx`. If not already tagged, can be added by ``` git tag -fa v0.0.xx ``` - 2. push tag to master + 2. push tag ``` git push origin ``` diff --git a/docs/source/Code_Guidelines.md b/docs/source/Code_Guidelines.md index 8d23637eb2..ccfa2a6baa 100644 --- a/docs/source/Code_Guidelines.md +++ b/docs/source/Code_Guidelines.md @@ -69,4 +69,4 @@ in terms of power. Also Spark APIs are mainly in Scala2. Every new behavior should be unit-tested. We have implemented a fuzzing framework that can produce data randomly as scala objects or spark tables - [see](../../spark/src/test/scala/ai/chronon/spark/test/DataFrameGen.scala). Use it for testing. -Python code is also covered by tests - [see](https://github.com/airbnb/chronon/tree/master/api/py/test). \ No newline at end of file +Python code is also covered by tests - [see](https://github.com/airbnb/chronon/tree/main/api/py/test). \ No newline at end of file diff --git a/docs/source/authoring_features/ChainingFeatures.md b/docs/source/authoring_features/ChainingFeatures.md index 54bdc69e5f..8abac087dd 100644 --- a/docs/source/authoring_features/ChainingFeatures.md +++ b/docs/source/authoring_features/ChainingFeatures.md @@ -79,9 +79,9 @@ enriched_listings = Join( ``` ### Configuration Example -[Chaining GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/sample_team/sample_chaining_group_by.py) +[Chaining GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/sample_team/sample_chaining_group_by.py) -[Chaining Join](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/joins/sample_team/sample_chaining_join.py) +[Chaining Join](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/joins/sample_team/sample_chaining_join.py) ## Clarifications - The goal of chaining is to use output of a Join as input to downstream computations like GroupBy or a Join. As of today we support the case 1 and case 2 in future plan diff --git a/docs/source/authoring_features/GroupBy.md b/docs/source/authoring_features/GroupBy.md index 07f724f949..6c39480fbc 100644 --- a/docs/source/authoring_features/GroupBy.md +++ b/docs/source/authoring_features/GroupBy.md @@ -27,7 +27,7 @@ This can be achieved by using the output of one `GroupBy` as the input to the ne ## Supported aggregations -All supported aggregations are defined [here](https://github.com/airbnb/chronon/blob/master/api/thrift/api.thrift#L51). +All supported aggregations are defined [here](https://github.com/airbnb/chronon/blob/main/api/thrift/api.thrift#L51). Chronon supports powerful aggregation patterns and the section below goes into detail of the properties and behaviors of aggregations. @@ -181,7 +181,7 @@ If you look at the parameters column in the above table - you will see `k`. For approx_unique_count and approx_percentile - k stands for the size of the `sketch` - the larger this is, the more accurate and expensive to compute the results will be. Mapping between k and size for approx_unique_count is -[here](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/cpc/CpcSketch.java#L180) +[here](https://github.com/apache/incubator-datasketches-java/blob/main/src/main/java/org/apache/datasketches/cpc/CpcSketch.java#L180) for approx_percentile is the first table in [here](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html). `percentiles` for `approx_percentile` is an array of doubles between 0 and 1, where you want percentiles at. (Ex: "[0.25, 0.5, 0.75]") @@ -193,7 +193,7 @@ The following examples are broken down by source type. We strongly suggest makin ## Realtime Event GroupBy examples -This example is based on the [returns](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/returns.py) GroupBy from the quickstart guide that performs various aggregations over the `refund_amt` column over various windows. +This example is based on the [returns](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/returns.py) GroupBy from the quickstart guide that performs various aggregations over the `refund_amt` column over various windows. ```python source = Source( @@ -236,7 +236,7 @@ v1 = GroupBy( ## Bucketed GroupBy Example -In this example we take the [Purchases GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/purchases.py) from the Quickstart tutorial and modify it to include buckets based on a hypothetical `"credit_card_type"` column. +In this example we take the [Purchases GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/purchases.py) from the Quickstart tutorial and modify it to include buckets based on a hypothetical `"credit_card_type"` column. ```python source = Source( @@ -283,7 +283,7 @@ v1 = GroupBy( ## Simple Batch Event GroupBy examples -Example GroupBy with windowed aggregations. Taken from [purchases.py](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/purchases.py). +Example GroupBy with windowed aggregations. Taken from [purchases.py](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/purchases.py). Important things to note about this case relative to the streaming GroupBy: * The default accuracy here is `SNAPSHOT` meaning that updates to the online KV store only happen in batch, and also backfills will be midnight accurate rather than intra day accurate. @@ -329,7 +329,7 @@ v1 = GroupBy( ### Batch Entity GroupBy examples -This is taken from the [Users GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/users.py) from the quickstart tutorial. +This is taken from the [Users GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/users.py) from the quickstart tutorial. ```python diff --git a/docs/source/authoring_features/Join.md b/docs/source/authoring_features/Join.md index f86181463f..57e2aaa061 100644 --- a/docs/source/authoring_features/Join.md +++ b/docs/source/authoring_features/Join.md @@ -6,7 +6,7 @@ Let's use an example to explain this further. In the [Quickstart](../getting_sta This is important because it means that when we serve the model online, inference will be made at checkout time, and therefore backfilled features for training data should correspond to a historical checkout event, with features computed as of those checkout times. In other words, every row of training data for the model has identical feature values to what the model would have seen had it made a production inference request at that time. -To see how we do this, let's take a look at the left side of the join definition (taken from [Quickstart Training Set Join](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/joins/quickstart/training_set.py)). +To see how we do this, let's take a look at the left side of the join definition (taken from [Quickstart Training Set Join](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/joins/quickstart/training_set.py)). ```python source = Source( diff --git a/docs/source/authoring_features/Source.md b/docs/source/authoring_features/Source.md index cc1f951462..49a6041e86 100644 --- a/docs/source/authoring_features/Source.md +++ b/docs/source/authoring_features/Source.md @@ -18,7 +18,7 @@ All sources are basically composed of the following pieces*: ## Streaming EventSource -Taken from the [returns.py](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/returns.py) example GroupBy in the quickstart tutorial. +Taken from the [returns.py](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/returns.py) example GroupBy in the quickstart tutorial. ```python source = Source( @@ -84,7 +84,7 @@ As you can see, a pre-requisite to using the streaming `EntitySource` is a chang ## Batch EntitySource -Taken from the [users.py](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/users.py) example GroupBy in the quickstart tutorial. +Taken from the [users.py](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/users.py) example GroupBy in the quickstart tutorial. ```python source = Source( diff --git a/docs/source/authoring_features/StagingQuery.md b/docs/source/authoring_features/StagingQuery.md index ded0b6b514..a3e8530649 100644 --- a/docs/source/authoring_features/StagingQuery.md +++ b/docs/source/authoring_features/StagingQuery.md @@ -57,9 +57,9 @@ v1 = Join( ``` Note: The output namespace of the staging query is dependent on the metaData value for output_namespace. By default, the -metadata is extracted from [teams.json](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/teams.json) (or default team if one is not set). +metadata is extracted from [teams.json](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/teams.json) (or default team if one is not set). -**[See more configuration examples here](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/staging_queries)** +**[See more configuration examples here](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/staging_queries)** ## Date Logic and Template Parameters diff --git a/docs/source/getting_started/Tutorial.md b/docs/source/getting_started/Tutorial.md index 1ddfa9b32a..485e73f898 100644 --- a/docs/source/getting_started/Tutorial.md +++ b/docs/source/getting_started/Tutorial.md @@ -19,7 +19,7 @@ Does not include: ## Setup -To get started with the Chronon, all you need to do is download the [docker-compose.yml](https://github.com/airbnb/chronon/blob/master/docker-compose.yml) file and run it locally: +To get started with the Chronon, all you need to do is download the [docker-compose.yml](https://github.com/airbnb/chronon/blob/main/docker-compose.yml) file and run it locally: ```bash curl -o docker-compose.yml https://chronon.ai/docker-compose.yml @@ -34,7 +34,7 @@ In this example, let's assume that we're a large online retailer, and we've dete ## Raw data sources -Fabricated raw data is included in the [data](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/data) directory. It includes four tables: +Fabricated raw data is included in the [data](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/data) directory. It includes four tables: 1. Users - includes basic information about users such as account created date; modeled as a batch data source that updates daily 2. Purchases - a log of all purchases by users; modeled as a log table with a streaming (i.e. Kafka) event-bus counterpart @@ -101,11 +101,11 @@ v1 = GroupBy( ) ``` -See the whole code file here: [purchases GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/purchases.py). This is also in your docker image. We'll be running computation for it and the other GroupBys in [Step 3 - Backfilling Data](#step-3---backfilling-data). +See the whole code file here: [purchases GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/purchases.py). This is also in your docker image. We'll be running computation for it and the other GroupBys in [Step 3 - Backfilling Data](#step-3---backfilling-data). **Feature set 2: Returns data features** -We perform a similar set of aggregations on returns data in the [returns GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/returns.py). The code is not included here because it looks similar to the above example. +We perform a similar set of aggregations on returns data in the [returns GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/returns.py). The code is not included here because it looks similar to the above example. **Feature set 3: User data features** @@ -127,7 +127,7 @@ v1 = GroupBy( ) ``` -Taken from the [users GroupBy](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/group_bys/quickstart/users.py). +Taken from the [users GroupBy](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/group_bys/quickstart/users.py). ### Step 2 - Join the features together @@ -160,7 +160,7 @@ v1 = Join( ) ``` -Taken from the [training_set Join](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/joins/quickstart/training_set.py). +Taken from the [training_set Join](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/joins/quickstart/training_set.py). The `left` side of the join is what defines the timestamps and primary keys for the backfill (notice that it is built on top of the `checkout` event, as dictated by our use case). @@ -330,4 +330,4 @@ Using chronon for your feature engineering work simplifies and improves your ML 4. Chronon exposes easy endpoints for feature fetching. 5. Consistency is guaranteed and measurable. -For a more detailed view into the benefits of using Chronon, see [Benefits of Chronon documentation](https://github.com/airbnb/chronon/tree/master?tab=readme-ov-file#benefits-of-chronon-over-other-approaches). +For a more detailed view into the benefits of using Chronon, see [Benefits of Chronon documentation](https://github.com/airbnb/chronon/tree/main?tab=readme-ov-file#benefits-of-chronon-over-other-approaches). diff --git a/docs/source/setup/Data_Integration.md b/docs/source/setup/Data_Integration.md index 6d0ff7985e..ce10523201 100644 --- a/docs/source/setup/Data_Integration.md +++ b/docs/source/setup/Data_Integration.md @@ -10,11 +10,11 @@ Chronon jobs require Spark to run. If you already have a spark environment up an ## Configuring Spark -To configure Chronon to run on spark, you just need a `spark_submit.sh` script that can be used in Chronon's [`run.py`](https://github.com/airbnb/chronon/blob/master/api/py/ai/chronon/repo/run.py) Python script (this is the python-based CLI entry point for all jobs). +To configure Chronon to run on spark, you just need a `spark_submit.sh` script that can be used in Chronon's [`run.py`](https://github.com/airbnb/chronon/blob/main/api/py/ai/chronon/repo/run.py) Python script (this is the python-based CLI entry point for all jobs). -We recommend putting your `spark_submit.sh` within a `scripts/` subdirectory of your main `chronon` directory (see [Developer Setup docs](./Developer_Setup.md) for how to setup the main `chronon` directory.). If you do that, then you can use `run.py` as-is, as that is the [default location](https://github.com/airbnb/chronon/blob/master/api/py/ai/chronon/repo/run.py#L483) for `spark_submit.sh`. +We recommend putting your `spark_submit.sh` within a `scripts/` subdirectory of your main `chronon` directory (see [Developer Setup docs](./Developer_Setup.md) for how to setup the main `chronon` directory.). If you do that, then you can use `run.py` as-is, as that is the [default location](https://github.com/airbnb/chronon/blob/main/api/py/ai/chronon/repo/run.py#L483) for `spark_submit.sh`. -You can see an example `spark_submit.sh` script used by the quickstart guide here: [Quickstart example spark_submit.sh](https://github.com/airbnb/chronon/blob/master/api/py/test/sample/scripts/spark_submit.sh). +You can see an example `spark_submit.sh` script used by the quickstart guide here: [Quickstart example spark_submit.sh](https://github.com/airbnb/chronon/blob/main/api/py/test/sample/scripts/spark_submit.sh). Note that this replies on an environment variable set in the `docker-compose.yml` which basically just points `$SPARK_SUBMIT` variable to the system level `spark-submit` binary. diff --git a/docs/source/setup/Developer_Setup.md b/docs/source/setup/Developer_Setup.md index 26ec17d8e6..4311bd85ad 100644 --- a/docs/source/setup/Developer_Setup.md +++ b/docs/source/setup/Developer_Setup.md @@ -34,7 +34,7 @@ Key points: 2. There are `group_bys` and `joins` subdirectories inside the root directory, under which there are team directories. Note that the team directory names must match what is within `teams.json` 3. Within each of these team directories are the actual user-written chronon files. Note that there can be sub-directories within each team directory for organization if desired. -For an example setup of this directory, see the [Sample](https://github.com/airbnb/chronon/tree/master/api/py/test/sample) that is also mounted to the docker image that is used in the Quickstart guide. +For an example setup of this directory, see the [Sample](https://github.com/airbnb/chronon/tree/main/api/py/test/sample) that is also mounted to the docker image that is used in the Quickstart guide. You can also use the following command to create a scratch directory from your `cwd`: diff --git a/docs/source/setup/Online_Integration.md b/docs/source/setup/Online_Integration.md index 51b3b36cda..dfb0153686 100644 --- a/docs/source/setup/Online_Integration.md +++ b/docs/source/setup/Online_Integration.md @@ -10,11 +10,11 @@ This integration gives Chronon the ability to: ## Example -If you'd to start with an example, please refer to the [MongoDB Implementation in the Quickstart Guide](https://github.com/airbnb/chronon/tree/master/quickstart/mongo-online-impl/src/main/scala/ai/chronon/quickstart/online). This provides a complete working example of how to integrate Chronon with MongoDB. +If you'd to start with an example, please refer to the [MongoDB Implementation in the Quickstart Guide](https://github.com/airbnb/chronon/tree/main/quickstart/mongo-online-impl/src/main/scala/ai/chronon/quickstart/online). This provides a complete working example of how to integrate Chronon with MongoDB. ## Components -**KVStore**: The biggest part of the API implementation is the [KVStore](https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/Api.scala#L43). +**KVStore**: The biggest part of the API implementation is the [KVStore](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/Api.scala#L43). ```scala object KVStore { @@ -47,11 +47,11 @@ trait KVStore { There are three functions to implement as part of this integration: 1. `create`: which takes a string and creates a new database/dataset with that name. -2. `multiGet`: which takes a `Seq` of [`GetRequest`](https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/Api.scala#L33) and converts them into a `Future[Seq[GetResponse]]` by querying the underlying KVStore. -3. `multiPut`: which takes a `Seq` of [`PutRequest`](https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/Api.scala#L38) and converts them into `Future[Seq[Boolean]]` (success/fail) by attempting to insert them into the underlying KVStore. +2. `multiGet`: which takes a `Seq` of [`GetRequest`](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/Api.scala#L33) and converts them into a `Future[Seq[GetResponse]]` by querying the underlying KVStore. +3. `multiPut`: which takes a `Seq` of [`PutRequest`](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/Api.scala#L38) and converts them into `Future[Seq[Boolean]]` (success/fail) by attempting to insert them into the underlying KVStore. 4. `bulkPut`: to upload a hive table into your kv store. It takes the table name and partitions as `String`s as well as the dataset as a `String`. If you have another mechanism (like an airflow upload operator) to upload data from hive into your kv stores you don't need to implement this method. -See the [MongoDB example here](https://github.com/airbnb/chronon/blob/master/quickstart/mongo-online-impl/src/main/scala/ai/chronon/quickstart/online/MongoKvStore.scala). +See the [MongoDB example here](https://github.com/airbnb/chronon/blob/main/quickstart/mongo-online-impl/src/main/scala/ai/chronon/quickstart/online/MongoKvStore.scala). **StreamDecoder**: This is responsible for "decoding" or converting the raw values that Chronon streaming jobs will read into events that it knows how to process. @@ -98,12 +98,12 @@ Chronon has a type system that can map to Spark's or Avro's type system. Schema | StructType | Array[Any] | -See the [Quickstart example here](https://github.com/airbnb/chronon/blob/master/quickstart/mongo-online-impl/src/main/scala/ai/chronon/quickstart/online/QuickstartMutationDecoder.scala). +See the [Quickstart example here](https://github.com/airbnb/chronon/blob/main/quickstart/mongo-online-impl/src/main/scala/ai/chronon/quickstart/online/QuickstartMutationDecoder.scala). -**API:** The main API that requires implementation is [API](https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/Api.scala#L151). This combines the above implementations with other client and logging configuration. +**API:** The main API that requires implementation is [API](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/Api.scala#L151). This combines the above implementations with other client and logging configuration. -[ChrononMongoOnlineImpl](https://github.com/airbnb/chronon/blob/master/quickstart/mongo-online-impl/src/main/scala/ai/chronon/quickstart/online/ChrononMongoOnlineImpl.scala) Is an example implemenation of the API. +[ChrononMongoOnlineImpl](https://github.com/airbnb/chronon/blob/main/quickstart/mongo-online-impl/src/main/scala/ai/chronon/quickstart/online/ChrononMongoOnlineImpl.scala) Is an example implemenation of the API. Once you have the api object you can build a fetcher class using the api object like so diff --git a/docs/source/setup/Orchestration.md b/docs/source/setup/Orchestration.md index 908b6de589..ca8829f1e9 100644 --- a/docs/source/setup/Orchestration.md +++ b/docs/source/setup/Orchestration.md @@ -6,29 +6,29 @@ Airflow is currently the best supported method for orchestration, however, other ## Airflow Integration -See the [Airflow Directory](https://github.com/airbnb/chronon/tree/master/airflow) for initial boilerplate code. +See the [Airflow Directory](https://github.com/airbnb/chronon/tree/main/airflow) for initial boilerplate code. The files in this directory can be used to create the following Chronon Airflow DAGs. -1. GroupBy DAGs, created by [group_by_dag_constructor.py](https://github.com/airbnb/chronon/tree/master/airflow/group_by_dag_constructor.py): +1. GroupBy DAGs, created by [group_by_dag_constructor.py](https://github.com/airbnb/chronon/tree/main/airflow/group_by_dag_constructor.py): 1. `chronon_batch_dag_{team_name}`: One DAG per team that uploads snapshots of computed features to the KV store for online group_bys, and frontfills daily snapshots for group_bys. 2. `chronon_streaming_dag_{team_name}`: One DAG per team that runs Streaming jobs for `online=True, realtime=True` GroupBys. These tasks run every 15 minutes and are configured to "keep alive" streaming jobs (i.e. do nothing if running, else attempt restart if dead/not started). -2. Join DAGs, created by [join_dag_constructor.py](https://github.com/airbnb/chronon/tree/master/airflow/join_dag_constructor.py): +2. Join DAGs, created by [join_dag_constructor.py](https://github.com/airbnb/chronon/tree/main/airflow/join_dag_constructor.py): 1. `chronon_join_{join_name}`: One DAG per join that performs backfill and daily frontfill of join data to the offline Hive table. -3. Staging Query DAGs, created by [staging_query_dag_constructor.py](https://github.com/airbnb/chronon/tree/master/airflow/staging_query_dag_constructor.py): +3. Staging Query DAGs, created by [staging_query_dag_constructor.py](https://github.com/airbnb/chronon/tree/main/airflow/staging_query_dag_constructor.py): 1. `chronon_staging_query_{team_name}`: One DAG per team that creates daily jobs for each Staging Query for the team. -4. Online/Offline Consistency Check DAGs, created by [online_offline_consistency_dag_constructor.py](https://github.com/airbnb/chronon/tree/master/airflow/online_offline_consistency_dag_constructor.py): +4. Online/Offline Consistency Check DAGs, created by [online_offline_consistency_dag_constructor.py](https://github.com/airbnb/chronon/tree/main/airflow/online_offline_consistency_dag_constructor.py): 1. `chronon_online_offline_comparison_{join_name}`: One DAG per join that computes the consistency of online serving data vs offline data for that join, and outputs the measurements to a stats table for each join that is configured. Note that logging must be enabled for this pipeline to work. To deploy this to your airflow environment, first copy everything in this directory over to your Airflow directory (where your other DAG files live), then set the following configurations: -1. Set your configuration variables in [constants.py](https://github.com/airbnb/chronon/tree/master/airflow/constants.py). -2. Implement the `get_kv_store_upload_operator` function in [helpers.py](https://github.com/airbnb/chronon/tree/master/airflow/helpers.py). **This is only required if you want to use Chronon online serving**. +1. Set your configuration variables in [constants.py](https://github.com/airbnb/chronon/tree/main/airflow/constants.py). +2. Implement the `get_kv_store_upload_operator` function in [helpers.py](https://github.com/airbnb/chronon/tree/main/airflow/helpers.py). **This is only required if you want to use Chronon online serving**. ## Alternate Integrations -While Airflow is currently the most well-supported integration, there is no reason why you couldn't choose a different orchestration engine to power the above flows. If you're interested in such an integration and you think that the community might benefit from your work, please consider [contributing](https://github.com/airbnb/chronon/blob/master/CONTRIBUTE.md) back to the project. +While Airflow is currently the most well-supported integration, there is no reason why you couldn't choose a different orchestration engine to power the above flows. If you're interested in such an integration and you think that the community might benefit from your work, please consider [contributing](https://github.com/airbnb/chronon/blob/main/CONTRIBUTE.md) back to the project. If you have questions about how to approach a different integration, feel free to ask for help in the [community Discord channel](https://discord.gg/GbmGATNqqP). diff --git a/docs/source/test_deploy_serve/Serve.md b/docs/source/test_deploy_serve/Serve.md index 001f04ce42..e967447c8a 100644 --- a/docs/source/test_deploy_serve/Serve.md +++ b/docs/source/test_deploy_serve/Serve.md @@ -4,15 +4,15 @@ The main way to serve production Chronon data online is with the Java or Scala F The main online Java Fetcher libraries can be found here: -1. [`JavaFetcher`](https://github.com/airbnb/chronon/blob/master/online/src/main/java/ai/chronon/online/JavaFetcher.java) -2. [`JavaRequest`](https://github.com/airbnb/chronon/blob/master/online/src/main/java/ai/chronon/online/JavaRequest.java) -3. [`JavaResponse`](https://github.com/airbnb/chronon/blob/master/online/src/main/java/ai/chronon/online/JavaResponse.java) +1. [`JavaFetcher`](https://github.com/airbnb/chronon/blob/main/online/src/main/java/ai/chronon/online/JavaFetcher.java) +2. [`JavaRequest`](https://github.com/airbnb/chronon/blob/main/online/src/main/java/ai/chronon/online/JavaRequest.java) +3. [`JavaResponse`](https://github.com/airbnb/chronon/blob/main/online/src/main/java/ai/chronon/online/JavaResponse.java) And their scala counterparts: -1. [`Fetcher`](https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/Fetcher.scala) -2. [`Request`](https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/Fetcher.scala#L39) -3. [`Response`](https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/Fetcher.scala#L48) +1. [`Fetcher`](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/Fetcher.scala) +2. [`Request`](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/Fetcher.scala#L39) +3. [`Response`](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/Fetcher.scala#L48) Example Implementation diff --git a/proposals/CHIP-1.md b/proposals/CHIP-1.md index b5a05c369f..d2f0de409e 100644 --- a/proposals/CHIP-1.md +++ b/proposals/CHIP-1.md @@ -50,7 +50,7 @@ The caches will be configured on a per-GroupBy basis, i.e. two caches per GroupB Caching will be an opt-in feature that can be enabled by Chronon developers. -Most of the code changes are in [FetcherBase.scala](https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/FetcherBase.scala). +Most of the code changes are in [FetcherBase.scala](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/FetcherBase.scala). ### Batch Caching Details @@ -144,7 +144,7 @@ The size of the cache should ideally be set in terms of maximum memory usage (e. ### Step 1: BatchIr Caching -We start by caching the conversion from `batchBytes` to `FinalBatchIr` (the [toBatchIr function in FetcherBase](https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/FetcherBase.scala#L102)) and `Map[String, AnyRef]`. +We start by caching the conversion from `batchBytes` to `FinalBatchIr` (the [toBatchIr function in FetcherBase](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/FetcherBase.scala#L102)) and `Map[String, AnyRef]`. To make testing easier, we'll disable this feature by default and enable it via Java Args. @@ -166,7 +166,7 @@ Results: will add ### Step 3: `TiledIr` Caching -The second step is caching [tile bytes to TiledIr](https://github.com/airbnb/chronon/blob/master/online/src/main/scala/ai/chronon/online/TileCodec.scala#L77C67-L77C67). This is only possible if the tile bytes contain information about whether a tile is complete (i.e. it won’t be updated anymore). The Flink side marks tiles as complete. +The second step is caching [tile bytes to TiledIr](https://github.com/airbnb/chronon/blob/main/online/src/main/scala/ai/chronon/online/TileCodec.scala#L77C67-L77C67). This is only possible if the tile bytes contain information about whether a tile is complete (i.e. it won’t be updated anymore). The Flink side marks tiles as complete. This cache can be "monoid-aware". Instead of storing multiple consecutive tiles for a given time range, we combine the tiles and store a single, larger tile in memory. For example, we combine two tiles, [0, 1) and [1, 2), into one, [0, 2).