- Use the library and give feedback: report bugs, request features.
- Add sample Jupyter notebooks, Python or Scala code examples, documentation pages.
- Fix bugs and issues.
- Add new features, such as data transformations or machine learning algorithms.
- Review pull requests from other contributors.
You can give feedback, report bugs and request new features anytime by opening an issue. Also, you can up-vote or comment on existing issues.
If you want to add code, examples or documentation to the repository, follow this process:
- Preferably, get started by tackling existing issues to get yourself acquainted with the library source and the process.
- Open an issue, or comment on an existing issue to discuss your contribution and design, to ensure your contribution is a good fit and doesn't duplicate on-going work.
- Any algorithm you're planning to contribute should be well known and accepted for production use, and backed by research papers.
- Algorithms should be highly scalable and suitable for very large datasets.
- All contributions need to comply with the MIT License. Contributors external to Microsoft need to sign CLA.
- Fork the MMLSpark repository.
- Implement your algorithm in Scala, using our wrapper generation mechanism to produce PySpark bindings.
- Use SparkML
PipelineStage
s so your algorithm can be used as a part of pipeline. - For parameters use
MMLParam
s. - Implement model saving and loading by extending SparkML
MLReadable
. - Use good Scala style.
- Binary dependencies should be on Maven Central.
- See this pull request for an example contribution.
- Set up build environment. Use a Linux machine or VM (we use Ubuntu, but other
distros should work too), and install environment using the
runme
script. - Test your code locally.
- Add tests using ScalaTests — unit tests are required.
- A sample notebook is required as an end-to-end test.
- Add a sample Jupyter notebook that shows the intended use case of your algorithm, with instructions in step-by-step manner. (The same notebook could be used for testing the code.)
- Add in-line ScalaDoc comments to your source code, to generate the API reference documentation
- In most cases, you should squash your commits into one.
- Open a pull request, and link it to the discussion issue you created earlier.
- An MMLSpark core team member will trigger a build to test your changes.
- Fix any build failures. (The pull request will have comments from the build with useful links.)
- Wait for code reviews from core team members and others.
- Fix issues found in code review and re-iterate.
- Wait for a core team member to merge your code in.
- Your feature will be available through a Docker image and script installation
in the next release, which typically happens around once a month. You can try
out your features sooner by using build artifacts for the version that has
your changes merged in (such versions end with a
.devN
).
If in doubt about how to do something, see how it was done in existing code or pull requests, and don't hesitate to ask.