feat: Add command to load test tracking events

openedx · Mar 20, 2024 · ed5057d · ed5057d
1 parent 1e04acd
commit ed5057d
Show file tree

Hide file tree

Showing 14 changed files with 995 additions and 3 deletions.
diff --git a/docs/how-tos/index.rst b/docs/how-tos/index.rst
@@ -1,2 +1,7 @@
 How-tos
 #######
+
+.. toctree::
+   :maxdepth: 2
+
+   load_test_tracking_pipeline
diff --git a/docs/how-tos/load_test_tracking_pipeline.rst b/docs/how-tos/load_test_tracking_pipeline.rst
@@ -0,0 +1,145 @@
+Load Testing Tracking Event Pipelines
+#####################################
+
+Aspects has several ways of getting learning traces from tracking events to xAPI statements and then to ClickHouse. In order to be able to emulate them all in a realistic way for performance testing we've created two management commands that are available in edx-platform when ``platform-plugin-aspects`` is installed:
+
+``load_test_tracking_events`` generates tracking events by creating a test course, a configured number of test users, and enrolling / unenrolling them in a loop. There are options to run a fixed number of events, or to run until killed. There is also a configurable sleep time that can be calibrated for long soak tests, or set to 0 to run in a tight loop. `Note: This needs to be run in CMS as it uses CMS APIs to create the course!`
+
+All options and defaults are described by running:
+
+``tutor local run cms ./manage.py cms load_test_tracking_events --help``
+
+
+``monitor_load_test_tracking`` gathers performance data on the given system while a test is running and stores the data in ClickHouse for later reporting. A configurable sleep time allows you to determine how often stats are collected. The supported backends are:
+
+- ``celery``
+- ``vector``
+- ``redis_bus``
+- ``kafka_bus``
+
+All options and defaults are described by running:
+
+``tutor local run lms ./manage.py lms monitor_load_test_tracking --help``
+
+Each backend stores different statistics, since they each have unique properties. They log the most important stats to the the console for easy monitoring of the test. The data points we're most interested in capturing are:
+
+- How old is the most recent xAPI event in ClickHouse? What is the lag from event generation to being queryable?
+- How far behind is the pipeline? How many events per second can this configuration handle? When will we run out of storage and break the system?
+
+
+Running a test
+--------------
+
+**The test will create persistent data in your environment! Do not run on production server!**
+
+#. Make sure you have an isolated test system that is configured in a way that you understand. The point of the test is to stress the system, so using shared infrastructure will skew the test results or potentially cause outages.
+
+#. Note which backend your system is configured to use, by default Open edX uses Celery.
+
+#. Make sure this plugin is installed and activated, configuration has been saved, and you have rebuilt the ``openedx`` image.
+
+#. Start up your environment and let it hit steady state, you may wish to log in and go to the course homepage just to make sure all startup is complete, otherwise the first events will be artificially slow.
+
+#. Start the monitor. This command will pull ClickHouse lag and Celery queue lengths every 5 seconds: ``tutor local run lms ./manage.py lms monitor_load_test_tracking --backend celery --sleep_time 5``
+
+#. Start the test. This command will create a test course, 10 test users, and create 100 events with 0.5 seconds sleep in between: ``tutor local run cms ./manage.py cms load_test_tracking_events --num_users 10 --num_events 100 --sleep_time 0.5``
+
+#. Alternatively you can run a test that will continue until stopped, and configure a prefix to your user names if you want to separate them from other tests in the database: ``tutor local run cms ./manage.py cms load_test_tracking_events --num_users 10 --run_until_killed --sleep_time 0.5 --username_prefix loadtest2_``
+
+#. Stop the test and monitor with ``ctrl-c``, you may want to let the monitor run until the queue of your system is cleared to get full data on how long it takes to recover from a backlog of events.
+
+#. Check the table in ClickHouse for the results of the run: ``event_sink.load_test_stats``. Each run has a unique identifier, and each row has a timestamp. The stats themselves are stored in JSON as they differ a great deal between backends. With this information you should be able to chart a run and see how the system performs at various levels of load.
+
+Celery Notes
+------------
+
+Celery can be scaled by adding more CMS workers.
+
+The JSON in ClickHouse for Celery looks like this::
+
+    {
+        "clickhouse": {
+            "total_rows": "1273",  # Total xAPI rows in the database
+            "most_recent_event": "2024-03-12 14:46:32.828206",
+            "lag_seconds": "2912"  # Difference between now() and the most_recent_event
+        },
+        "celery": {
+            "lms_queue_length": 0,  # Size of the redis queue of pending Celery tasks for the LMS workers
+            "cms_queue_length": 0   # Size of the redis queue of pending Celery tasks for the CMS workers
+        }
+    }
+
+
+Vector Notes
+------------
+
+Vector scales differently depending on your deployment strategy. Note that Vector's stats are different from other backends. We are only able to calculate the lag between when Vector reads a line from the logs and when it is sent to ClickHouse. There is no way of telling how far behind the log Vector is, but this lag is still useful to see if ClickHouse insert times are slowing down Vector. Instead we should rely on the ClickHouse lag_seconds metric to get a better idea of how far behind Vector is.
+
+The JSON in ClickHouse for Vector looks like this::
+
+    {
+        "clickhouse": {
+            "total_rows": "1273",  # Total xAPI rows in the database
+            "most_recent_event": "2024-03-12 14:46:32.828206",
+            "lag_seconds": "2912"  # Difference between now() and the most_recent_event
+        },
+        "vector": {
+            "events_received": 20.0,
+            "events_sent": 10.0,
+            "lag": 10.0
+            }
+        }
+    }
+
+
+
+Redis Bus Notes
+---------------
+
+The redis bus can be scaled by adding more consumers.
+
+The JSON in ClickHouse for redis bus looks like this::
+
+    {
+        "clickhouse": {
+            "total_rows": "1273",  # Total xAPI rows in the database
+            "most_recent_event": "2024-03-12 14:46:32.828206",
+            "lag_seconds": "2912"  # Difference between now() and the most_recent_event
+        },
+        "redis_bus": {
+            "total_events": 77,  # Total number of events that have been added to the redis Stream
+            "consumers": [
+                {
+                    "name": "aspects",  # Name of each consumer in the consumer group
+                    "processing": 0,       # How many events are currently being processed by that consumer (should be 0 or 1)
+                    "queue_length": 0      # How many events are waiting to be processed in the stream
+                }
+            ]
+        }
+    }
+
+
+Kafka Bus Notes
+---------------
+
+The Kafka bus can be scaled by adding more consumers.
+
+The JSON in ClickHouse for the Kafka bus looks like this::
+
+    {
+        "clickhouse": {
+            "total_rows": "1273",  # Total xAPI rows in the database
+            "most_recent_event": "2024-03-12 14:46:32.828206",
+            "lag_seconds": "2912"  # Difference between now() and the most_recent_event
+        },
+        "kafka_bus": {
+            "topic": "dev-analytics",  # The name of the Kafka topic that's being read
+            "partitions": [
+                {
+                    "partition": 0,  # The index of the partition
+                    "lag": 150       # How many events are waiting to be processed by the partition
+                }
+            ]
+        }
+    }
+
diff --git a/platform_plugin_aspects/management/commands/load_test_tracking_events.py b/platform_plugin_aspects/management/commands/load_test_tracking_events.py
@@ -0,0 +1,216 @@
+"""
+Generates tracking events by creating test users and fake activity.
+
+This should never be run on a production server as it will generate a lot of
+bad data. It is entirely for benchmarking purposes in load test environments.
+It is also fragile due to reaching into the edx-platform testing internals.
+"""
+
+import logging
+import uuid
+from datetime import datetime, timedelta
+from random import choice
+from textwrap import dedent
+from time import sleep
+from typing import Any
+
+from django.core.management.base import BaseCommand, CommandError
+
+# For testing we won't be able to import from edx-platform
+try:  # pragma: no cover
+    from cms.djangoapps.contentstore.views.course import create_new_course_in_store
+    from common.djangoapps.student.helpers import do_create_account
+    from common.djangoapps.student.models.course_enrollment import CourseEnrollment
+    from openedx.core.djangoapps.user_authn.views.registration_form import (
+        AccountCreationForm,
+    )
+    from xmodule.modulestore import ModuleStoreEnum
+
+    RUNNING_IN_PLATFORM = True
+except ImportError:
+    create_new_course_in_store = None
+    do_create_account = None
+    CourseEnrollment = None
+    AccountCreationForm = None
+    ModuleStoreEnum = None
+
+    RUNNING_IN_PLATFORM = False
+
+log = logging.getLogger(__name__)
+
+
+class LoadTest:
+    """
+    Runs the load test and reports results to ClickHouse.
+    """
+
+    course = None
+    instructor = None
+    users = []
+    sent_event_count = 0
+
+    def __init__(self, num_users: int, username_prefix: str):
+        course_shortname = str(uuid.uuid4())[:6]
+        self.instructor = self.create_user(
+            username=f"instructor_{course_shortname}",
+            name="Instructor",
+            password="aspects",
+            email=f"instructor_{course_shortname}@openedx.invalid",
+        )
+
+        start_date = datetime.now() - timedelta(days=7)
+
+        fields = {"start": start_date, "display_name": f"Course {course_shortname}"}
+
+        log.info(
+            f"""Creating course:
+        Instructor: {self.instructor.id}
+        Org: "OEX"
+        Number: "{course_shortname}"
+        Run: "2024-1"
+        Fields: {fields}
+        """
+        )
+
+        self.course = create_new_course_in_store(
+            ModuleStoreEnum.Type.split,
+            self.instructor,
+            "OEX",
+            course_shortname,
+            "2024-1",
+            fields,
+        )
+
+        log.info(f"Created course {self.course.id}")
+        self.create_and_enroll_learners(num_users, username_prefix)
+
+    def create_and_enroll_learners(self, num_users, username_prefix):
+        """
+        Uses create test users and enroll them in our test course.
+        """
+        log.info(f"Creating {num_users} users prefixed with {username_prefix}.")
+
+        for _ in range(num_users):
+            user_short_name = str(uuid.uuid4())[:6]
+            u = self.create_user(
+                username=f"{username_prefix}_{user_short_name}",
+                name=f"Learner {user_short_name}",
+                password="aspects",
+                email=f"{user_short_name}@openedx.invalid",
+            )
+            self.users.append(u)
+            e = CourseEnrollment.get_or_create_enrollment(
+                user=u, course_key=self.course.id
+            )
+            e.is_active = True
+            e.save()
+
+    def create_user(self, **user_data):
+        """
+        Create, activate, and return a user using the edx-platform API.
+        """
+        account_creation_form = AccountCreationForm(data=user_data, tos_required=False)
+
+        user, _, _ = do_create_account(account_creation_form)
+        user.is_active = True
+        user.save()
+        return user
+
+    def trigger_events(
+        self, num_events: int, sleep_time: float, run_until_killed: bool
+    ) -> None:
+        """
+        Trigger the appropriate number of events based on configuration.
+        """
+
+        if run_until_killed:
+            log.info(f"Creating events until killed with {sleep_time} sleep between!")
+            while True:
+                self.trigger_event_and_sleep(sleep_time)
+        else:
+            log.info(f"Creating {num_events} event with {sleep_time} sleep between!")
+            for _ in range(num_events):
+                self.trigger_event_and_sleep(sleep_time)
+
+    def trigger_event_and_sleep(self, sleep_time: float) -> None:
+        """
+        Cause a tracking log to be emitted and sleep the specified amount of time.
+        """
+        user = choice(self.users)
+        log.info(f"Triggering event for user {user.username}.")
+        e = CourseEnrollment.get_or_create_enrollment(
+            user=user, course_key=self.course.id
+        )
+
+        if e.is_active:
+            e.unenroll(user, self.course.id)
+        else:
+            e.enroll(user, self.course.id)  # pragma: no cover
+
+        self.sent_event_count += 1
+        sleep(sleep_time)
+
+
+class Command(BaseCommand):
+    """
+    Create tracking log events for load testing purposes.
+
+    Example:
+    tutor local run lms ./manage.py lms load_test_tracking_events --sleep_time 0
+    """
+
+    help = dedent(__doc__).strip()
+
+    def add_arguments(self, parser: Any) -> None:
+        parser.add_argument(
+            "--num_users",
+            type=int,
+            default=10,
+            help="The number of users to create. All events will be generated for these learners.",
+        )
+        parser.add_argument(
+            "--username_prefix",
+            type=str,
+            default="lt_",
+            help="Prefix for the generated user names.",
+        )
+        parser.add_argument(
+            "--num_events",
+            type=int,
+            default=10,
+            help="The number of events to generate. This is ignored if --run_until_killed is set.",
+        )
+        parser.add_argument(
+            "--run_until_killed",
+            action="store_true",
+            default=False,
+            help="If this is set, the process will run endlessly until killed.",
+        )
+        parser.add_argument(
+            "--sleep_time",
+            type=float,
+            default=0.75,
+            help="Fractional number of seconds to sleep between sending events.",
+        )
+
+    def handle(self, *args, **options):
+        """
+        Create users and trigger events for them as configured above.
+        """
+        if not RUNNING_IN_PLATFORM:  # pragma: no cover
+            raise CommandError("This command must be run in the Open edX LMS or CMS.")
+
+        start = datetime.now()
+        lt = LoadTest(options["num_users"], options["username_prefix"])
+
+        try:
+            lt.trigger_events(
+                options["num_events"],
+                options["sleep_time"],
+                options["run_until_killed"],
+            )
+        except KeyboardInterrupt:
+            log.warning("Killed by keyboard, finishing.")
+
+        end = datetime.now()
+        log.info(f"Sent {lt.sent_event_count} events in {end - start}.")