Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: Cross-compile to Scala 2.12 #726

Closed
wants to merge 1 commit into from

Conversation

nightscape
Copy link
Contributor

@nightscape nightscape commented Oct 27, 2019

Done:

To Do:

  • Make tests succeed
  • Test doc generation

@welcome
Copy link

welcome bot commented Oct 27, 2019

💖 Thanks for opening your first pull request! 💖 We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. This helps us to create release messages and credit you for your hard work!
Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

Make sure to check out the developer guide for guidance on testing your change.

@msftclas
Copy link

msftclas commented Oct 27, 2019

CLA assistant check
All CLA requirements met.

@nightscape nightscape changed the title WIP: Cross-compile to Scala 2.12 WIP: build: Cross-compile to Scala 2.12 Oct 27, 2019
@nightscape
Copy link
Contributor Author

@drdarshan @mhamilton723 a lot of tests are failing because there seem to be missing datasets, possibly/probably caused by this

ERROR: The subscription of 'ce1dee05-8cf6-4ad6-990a-9c80868800ba' doesn't exist in cloud 'AzureCloud'.

I assume this subscription requires permissions to log in and retrieve the datasets, correct?
If so, is there a way I can get the datasets to test locally, or do you have to test this in your environment?

@mhamilton723
Copy link
Collaborator

Hey @nightscape thanks so much for this great PR.

Datasets can be downloaded from
https://mmlspark.blob.core.windows.net/installers/datasets-2019-05-02.tgz

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Oct 29, 2019

Codecov Report

Merging #726 into master will increase coverage by 2.84%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #726      +/-   ##
==========================================
+ Coverage   82.35%   85.20%   +2.84%     
==========================================
  Files         189      188       -1     
  Lines        8693     8643      -50     
  Branches      517      532      +15     
==========================================
+ Hits         7159     7364     +205     
+ Misses       1534     1279     -255     
Impacted Files Coverage Δ
...com/microsoft/ml/spark/cognitive/AzureSearch.scala 87.87% <100.00%> (ø)
...icrosoft/ml/spark/io/binary/BinaryFileFormat.scala 97.75% <100.00%> (+0.02%) ⬆️
.../microsoft/ml/spark/io/powerbi/PowerBIWriter.scala 88.57% <100.00%> (ø)
...ql/execution/streaming/DistributedHTTPSource.scala 84.35% <100.00%> (ø)
...om/microsoft/ml/spark/io/http/SharedVariable.scala 66.66% <0.00%> (-16.67%) ⬇️
...com/microsoft/ml/spark/cognitive/RESTHelpers.scala 35.00% <0.00%> (-15.00%) ⬇️
...n/scala/org/apache/spark/ml/param/ArrayParam.scala 60.00% <0.00%> (-10.00%) ⬇️
...ain/scala/com/microsoft/ml/spark/nn/BallTree.scala 82.85% <0.00%> (-4.77%) ⬇️
.../microsoft/ml/spark/core/schema/Categoricals.scala 82.10% <0.00%> (-4.36%) ⬇️
...oft/ml/spark/recommendation/RankingEvaluator.scala 88.00% <0.00%> (-4.31%) ⬇️
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 840781a...540554a. Read the comment docs.

@nightscape
Copy link
Contributor Author

Hi @mhamilton723, I had a look at the errors in the build, but at first glance I don’t see how they could be caused by the Scala 2.12 switch.
Unfortunately, I’ve been pulled into other topics and can’t spend much more time on this.
Would it be ok to chicken out here?

@mhamilton723
Copy link
Collaborator

@nightscape, appreciate you taking it this far. Just rebased on top of some latest test flakiness ixes so well see how it goes :)

@mhamilton723
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@nightscape nightscape marked this pull request as draft April 21, 2020 16:22
@nightscape nightscape changed the title WIP: build: Cross-compile to Scala 2.12 build: Cross-compile to Scala 2.12 Apr 21, 2020
@nightscape
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 726 in repo Azure/mmlspark

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Pull request contains merge conflicts.

@nightscape
Copy link
Contributor Author

For some reason, Github didn't pick up the changes I made to my branch.
Possibly caused by the recent hickups: https://www.githubstatus.com/
Will try changing something and force-pushing again.

@nightscape
Copy link
Contributor Author

@imatiach-msft, the build seems stuck...
Could you trigger it again?
Thank you!

@imatiach-msft
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@imatiach-msft
Copy link
Contributor

done!

@imatiach-msft
Copy link
Contributor

seems like a bunch of our dependencies don't have 2.12 versions yet?

@nightscape
Copy link
Contributor Author

@imatiach-msft right. I assume this is a new dependency, afair I had this branch compiling locally before I rebased on master.
Will try to get a 2.12 build for https://github.com/linkedin/isolation-forest

@nightscape
Copy link
Contributor Author

Askes for a 2.12 build here: linkedin/isolation-forest#14

@imatiach-msft
Copy link
Contributor

also adding @eisber for the isolation forest dependency

@imatiach-msft
Copy link
Contributor

I wonder if there is some way to make this an optional dependency for now for scala 2.12 build...

@eisber
Copy link
Collaborator

eisber commented Apr 24, 2020

@jverbus should be able to help with the dependency. he built the artifacts for spark 2.3.0 vs 2.4.3

https://search.maven.org/search?q=g:com.linkedin.isolation-forest%20AND%20a:isolation-forest_2.4.3_2.10

@nightscape nightscape force-pushed the scala_2.12 branch 2 times, most recently from 96dea67 to 269912f Compare June 2, 2020 07:46
@nightscape nightscape marked this pull request as ready for review June 2, 2020 07:50
@nightscape
Copy link
Contributor Author

@imatiach-msft @eisber I updated to the newly released isolation-forest version for Scala 2.12 (thanks @jverbus !) and rebased on master.
Unfortunately, I don't see any results from the CI run. Does that have to be triggered manually?

@eisber
Copy link
Collaborator

eisber commented Jun 4, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@nightscape
Copy link
Contributor Author

@eisber unfortunately I still don't see the exact error in the Azure pipeline.
Would you be so kind to paste the relevant parts here?
Thank you!

@eisber
Copy link
Collaborator

eisber commented Jun 15, 2020

@nightscape sorry for the long delay

Info Provided - Suite ClassBalancerSuite took 4.357s
[info] - yield proper weights
[info] + Test yield proper weights took 0.844s
Info Provided - Shutting down spark session
[info] CacherSuite:
20/06/04 12:10:09 WARN CacheManager: Asked to cache already cached data.
20/06/04 12:10:10 WARN CacheManager: Asked to cache already cached data.
20/06/04 12:10:10 WARN CacheManager: Asked to cache already cached data.
[info] - Serialization Fuzzing
[info] + Creating a spark session for suite CacherSuite
[info] + Test Serialization Fuzzing took 0.416s
20/06/04 12:10:10 WARN CacheManager: Asked to cache already cached data.
[info] - Experiment Fuzzing
[info] + Test Experiment Fuzzing took 0.002s
20/06/04 12:10:10 WARN CacheManager: Asked to cache already cached data.
[info] - Be the identity operation
[info] + Test Be the identity operation took 0.027s
Info Provided - Suite CacherSuite took 0.445s
Info Provided - Shutting down spark session
[info] SummarizeDataSuite:
20/06/04 12:10:10 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
[info] - Serialization Fuzzing
[info] + Creating a spark session for suite SummarizeDataSuite
[info] + Test Serialization Fuzzing took 7.856s
[info] - Experiment Fuzzing
[info] + Test Experiment Fuzzing took 1.413s
[info] - Smoke test for summarizing basic DF - schema transform
[info] + Test Smoke test for summarizing basic DF - schema transform took 0.017s
[info] - Smoke test for summary params
[info] + Test Smoke test for summary params took 0.007s
[info] - Smoke test for summarizing basic DF
[info] + Test Smoke test for summarizing basic DF took 1.611s
[info] - Smoke test for summarizing missings DF
[info] + Test Smoke test for summarizing missings DF took 1.552s
[info] - Smoke test for subset summarizing missings DF
[info] + Test Smoke test for subset summarizing missings DF took 0.47s
Alert Provided - Suite SummarizeDataSuite took 12.926s
Info Provided - Shutting down spark session
[info] RepartitionSuite:
[info] - Serialization Fuzzing
[info] + Creating a spark session for suite RepartitionSuite
[info] + Test Serialization Fuzzing took 0.389s
[info] - Experiment Fuzzing
[info] + Test Experiment Fuzzing took 0.002s
[info] - Work for several values of n
[info] + Test Work for several values of n took 0.058s
[info] - Should allow a user to set the partitions specifically in pipeline transform
[info] + Test Should allow a user to set the partitions specifically in pipeline transform took 0.058s
Info Provided - Suite RepartitionSuite took 0.507s
Info Provided - Shutting down spark session
[info] BatchIteratorSuite:
ratio: 1.0039155890709441, Batched: 2338ms, normal: 2329ms

[error] Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
[error] OpenJDK 64-Bit Server VM warning: You have loaded library /tmp/mml-natives8324893102437100607/libopencv_java320.so which might have disabled stack guard. The VM will try to fix the stack guard now.
[error] It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
[info] Could not generate getters and setters for class com.microsoft.ml.spark.automl.BestModel due to no default constructor
[info] Could not generate getters and setters for class com.microsoft.ml.spark.automl.TuneHyperparametersModel due to no default constructor
[info] Could not generate getters and setters for class com.microsoft.ml.spark.featurize.AssembleFeaturesModel due to no default constructor
[info] Could not generate getters and setters for class com.microsoft.ml.spark.featurize.CleanMissingDataModel due to no default constructor
[info] Could not generate getters and setters for class org.apache.spark.ml.PipelineModel due to no default constructor
[info] Could not generate getters and setters for class com.microsoft.ml.spark.featurize.text.TextFeaturizerModel due to no default constructor
[info] Could not generate getters and setters for class com.microsoft.ml.spark.isolationforest.IsolationForestModel due to no default constructor
[info] Could not generate getters and setters for class com.microsoft.ml.spark.lightgbm.LightGBMClassificationModel due to no default constructor
[info] Could not generate getters and setters for class com.microsoft.ml.spark.lightgbm.LightGBMRankerModel due to no default constructor
[info] Could not generate getters and setters for class com.microsoft.ml.spark.lightgbm.LightGBMRegressionModel due to no default constructor
[info] Could not generate getters and setters for class org.apache.spark.ml.PipelineModel due to no default constructor
[info] Could not generate getters and setters for class com.microsoft.ml.spark.stages.TimerModel due to no default constructor
[info] Could not generate getters and setters for class com.microsoft.ml.spark.train.TrainedClassifierModel due to no default constructor
[info] Could not generate getters and setters for class com.microsoft.ml.spark.train.TrainedRegressorModel due to no default constructor

@nightscape
Copy link
Contributor Author

nightscape commented Jun 15, 2020

Hmm, the only errors I see are the ones about OpenCV:

[error] Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
[error] OpenJDK 64-Bit Server VM warning: You have loaded library /tmp/mml-natives8324893102437100607/libopencv_java320.so which might have disabled stack guard. The VM will try to fix the stack guard now.
[error] It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.

I honestly don't have any clue what to do with them.
Could anyone from the Azure team take it from here
@mhamilton723 @imatiach-msft @eisber ?

@utsavavlino
Copy link

@mhamilton723 any inputs on the status , will there be 2.12 support for mmlspark soon?

@yosit
Copy link

yosit commented Sep 22, 2020

Any chance to push this forward?

@yosit
Copy link

yosit commented Oct 8, 2020

@nightscape a small fix to your branch would be to:
sed -i 's/scala-2.11/scala-2.12/' src/it/scala/com/microsoft/ml/spark/codegen/CodegenConfig.scala

@nightscape
Copy link
Contributor Author

Overall I think this needs to be pushed over the line by someone from the Azure team like @eisber, @imatiach-msft or @mhamilton723.
If someone from the Azure team tells me "Look, make these 3 changes and we'll take it from there", I'd be happy to.
But right now I won't invest more time until there is a commitment to actually get this merged.

@yosit
Copy link

yosit commented Oct 10, 2020

@nightscape Thank you for your contribution!

@eisber , @imatiach-msft , @mhamilton723 - We have been using this successfully at ZipRecruiter.
Happy to see this small changes merged.

@yosit yosit mentioned this pull request Oct 13, 2020
@srowen
Copy link

srowen commented Oct 30, 2020

Hi @imatiach-msft - greetings from Spark-land. I know a few of our customers love mmlspark and would love to use it on Spark 3. I see this and a related PR basically have that working. If I can help review, glad to - they'd love to get it released when you have a moment!

@hanbing1587
Copy link

hanbing1587 commented Nov 9, 2020

@nightscape I try to use "sbt clean package" to build mmlspark from this branch and generate a jar file. But I can't use the python API for lightgbm. The error information is "ModuleNotFoundError: No module named 'mmlspark.lightgbm._LightGBMClassifier' ". I estimate my building command is not correct. So what is the correct command to build mmlspark with python API?

@nightscape
Copy link
Contributor Author

@hanbing1587 unfortunately I don't know either...
I just changed the Scala version settings and fixed all compile errors.

@juanpaulo
Copy link

related to #912?

@nightscape
Copy link
Contributor Author

Definitely. I have no idea why the other PR got closed though, or why this one doesn't receive any love...
The Open Source part of this repo leaves room for improvement 😝

@imatiach-msft
Copy link
Contributor

closing as we already support scala 2.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants