Highlights
| <img width="800" src="https://mmlspark.blob.core.windows.net/graphics/emails/vw-blue-dark-orange.svg"> |<img width="800" src="https://mmlspark.blob.core.windows.net/graphics/emails/devops_recolor_2.svg"> | <img width="800" src="https://mmlspark.blob.core.windows.net/graphics/emails/lightgbm_on_spark.svg"> | <img width="800" src="https://mmlspark.blob.core.windows.net/graphics/emails/speech_to_text_2.svg"> |
|:--:|:--:|:--:|:--:|
| **Vowpal Wabbit on Spark** | **Quality and Build Refactor** | **LightGBM Ranking and More** | **Anomaly Detection and Speech To Text** |
| Fast, Sparse, and Scalable Text Analytics | New Azure Pipelines build with Code Coverage, CICD, and an organized package structure. | Barrier Execution mode, performance improvements, increased parameter coverage | New cognitive services on Spark |
New Features
Vowpal Wabbit on Spark: Fast and Sparse Text Analytics
- VW on Spark is a new collaboration between the [Vowpal Wabbit library](https://github.com/VowpalWabbit/vowpal_wabbit) and the Apache Spark community
- For full documentation check out the [VW on Spark Docs](https://github.com/Azure/mmlspark/blob/master/docs/vw.md)
- Added `VowpalWabbitClassifier` and `VowpalWabbitRegressor`
- Added [Vowpal Wabbit - Quantile Regression for Drug Discovery.ipynb](https://github.com/Azure/mmlspark/blob/master/notebooks/samples/Vowpal%20Wabbit%20-%20Quantile%20Regression%20for%20Drug%20Discovery.ipynb)
LightGBM on Spark
- Now supports barrier execution mode
- Added the `LightGBMRanker`
- Added `is_provide_training_metric` to LightGBMRanker.
- Enabled continued training with init score column
- Added batch training support
- Reduced memory usage
- Fixed issues with frozen jobs
- Fixes for multiclass classification
- Fixed issue where multiclass classification hangs due to partitions without all classes
HTTP on Spark
- Added `AnomalyDetector` and `SimpleAnomalyDetector` APIs
- Added `SpeechToText` transformer
- Improved service concurrency
- Added robustness to socket timeouts
Miscellaneous
- Codegen support for wrapping `Ranker` classes
- Notebooks now leverage public blob for faster execution
- Fixed summarize data column handling
- Better compute model statistics error messages
- Upgraded to Spark 2.4.3
- Added Spark on Kubernetes Helm Charts
- Added `StratifiedRepartition` transformer for ensuring partitions contain all classes
- Fixed issue where `ImageFeaturizer` could not be executed on Databricks 2.4.3
Build, Quality, and Infrastructure Refactor
Azure Pipelines Integration
- Tests parallelized on Azure Pipelines. Builds now take ~25min vs ~90min!
- Serverless Builds: Queue as many builds as needed with no machine maintenance costs
- Test results, error messages, and time are viewable from github PR section
- Individual Tests can be re-queued from the GitHub PR Page
- Builds can be queued using the pull request comment: `/azp run`.
- Full details can be seen by typing `/azp help`
- CI pipeline entirely specified in small .yaml file in git repo
<img width="600" src="https://mmlspark.blob.core.windows.net/graphics/emails/build.jpg">
Local Developer Support
- Dramatically simpler developer setup (all through SBT)
- Local developer setup now works on any platform including windows!
- Local setup no longer needs VM, Vagrant, or 30 min to import the library
- All build stages are SBT tasks and can be done locally for rapid testing
- This includes publishing maven packages to local repositories and the MMLSpark maven repo
- All secrets now managed by centralized Azure Key Vault
- IntelliJ will pick up on all scalastyle rules for editor-level style feedback while typing
Code Quality Gates
- Code Coverage now supported for every PR and reported in the comments and badge
- Coverage is now a check-in gate to never decrease
- Test coverage increased and dead code removed from the library
- Custom and auto-generated Python tests now supported
- CODEOWNERS file for better code reviews and maintenance
- Codacy integration for automated PR reviews
<img width="600" src="https://mmlspark.blob.core.windows.net/graphics/emails/codecov_2.gif">
Streamlined Library Structure
- MMLSpark now supports a true Scala/Java idiomatic package hierarchy
- Namespace hierarchy also reflected in PySpark code
- **Note: This will require changes to existing MMLSpark Programs. For Support in migrating please contact `mmlspark-supportmicrosoft.com`**
Maintainability and Community Management
- Issue and PR templates
- Gitter channel
- Welcome bot to greet new contributors
- Semantic Commits for autogenerating release notes
- Badges to display current and master versions in the README
Migration Support:
- For those that already have MMLSpark developer setups please read the new developer guide to reconfigure.
- For those that have standing PRs that need rebasing assistance please reach out to `mmlspark-supportmicrosoft.com`
- Please report any bugs or feedback!
Acknowledgements
We would like to acknowledge the developers and contributors, both internal and external who helped create this version of MMLSpark.
- Ilya Matiach, Markus Cozowicz, Scott Graham, Daniel Ciborowski, Christina Lee, Dalitso Banda, Shaochen Shi, Sudarshan Raghunathan, Anand Raman, Eli Barzilay, Nick Gonsalves, Tao Wu, Jeremy Reynolds, Miguel Fierro, Robert Alexander, AI CAT Team, Azure Search Team
Contributions, Collaborations, and Feedback Welcome!
|<img width="200" src="https://mmlspark.blob.core.windows.net/graphics/emails/spark.svg"> | <img width="500" src="https://mmlspark.blob.core.windows.net/graphics/emails/spacer.jpg"> | <img width="200" src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/96/Microsoft_logo_%282012%29.svg/800px-Microsoft_logo_%282012%29.svg.png"> |
|:--:|:--:|:--:|
Changes:
* 3bb48b8400e92d660355c10c9c6770f5d37f681a chore: bump version number
* b0797b37929968063a860ff8bc16900732c624a9 docs: Improve cog services on spark docs
* 8e966b3c098e6a6170221620638479fb7ec561c3 docs: Docs for Cognitive Services (647)
* eb0a421c360835b22dfefced8a841d0d39c10db8 docs: Improve VW on Spark Docs
* 54dbcadb21a5b4bc5147f61803a975436d7126ba docs: add VowpalWabbit documentation
* fb5b79f460dd3c57a19c6b658cb60ee64db0c949 docs: fix vw on spark description
* c0d5786aee8d41dda3361a5e5111a88275592327 docs: update readme badges and icons
* 071b6b0ab0ada8f3c1720949a6f3f84a16c2da87 docs: Add gitter badge
* 5c343567003af3546e3b62183b901429889edf76 docs: Add VW on Spark to table
* 1bdcdbfb4314d1e464c566b27806dace14a7bc20 chore: ignore .github folder for CI
<details><summary><b>See more</b></summary>
* 01d498c2f7c18bb57a3ecd2327482fc9696acd46 build: add sonatype publishing
* 8fab72d2662ed933d5fe551b1394a711b6145797 build: make e2e cancellable
* ddc7a4f910d391cb7b1b2d500fe37c48f3ecbc87 build: remove broken codecov flags (will reinstate when codecov fixes their service_
* 188cbdbf5a6d74e00e2351dfe78b994708bb0270 chore: Update issue templates
* f67b16aba8133cffeda350cd7be37577e64175a8 chore: fix welcome bot indenting
* eeb7eba1e0b3eda3996ed7a47451d1aa24b2286f fix: Fix logistic regression error when passing "--link logistic" (644)
* b6a4f9320697c264bf73b19879ca15c1e59b75f3 fix: fix socket timeout error (640)
* 856db6d5619ad30368576b6ee55577d24e91e030 build: add mcr publishing
* c6e44f95d96d3adc403e21985404e8527cebd6bf fix: fix issue with socket timeout in advanced handler
* 2425b7adbb7cc5f5a0ae56b19c864ebcc7445dc4 fix: update detect anomaly suite to make anomaly more pronounced
* 07c7fecf78af53d56f66565dd9b5033019eb71b1 style: run markdown through markdown linter
* a0e85f5a98ce01c14a3cf3ffca856282a3029822 build: increase setup timeouts
* 5c190f8eecd158fe32a318325ddd9f8fb94eb15d style: Fix style issues
* 4bf6f712fa64d43af0efd759813faaae94cf37a5 build: Add build cancel timeouts
* 915d68334eaeac2ed2fa8022bb5b4b3a3dadb039 build: add release job to Azure Pipelines
* e48f9cbea3c446888cf2005c129f8ede9cf513db build: Add github version badges
* 73581cbf19558df899cc909cb7e1aee3d7e5c72e build: fix flaky codecov upload
* ce1e66d3b17ca035a71dae9148d3adce611e1c37 build: fix e2e notebook cluster check
* 19aeb8037e3589fb6dbd25fe5840b54b2378ed98 build: Add behavior bot
* 72ccae226876f57f71cb8ff8e388b34ce05b7031 build: Make task retry part of bash script
* 16dd7f4eb55d7fa740c83d776599fb94598e361c Update formatting
* 3fe4db5934552edd34cc9f025faec0c5b2526a64 adding vagrant doc and fixing indentation in vagrantfile
* d58d6f41909ecafa057a5327374c1825331f66ce Vowpal Wabbit on Spark
* 95dc73464714793997dffa8050451e1e50cae4dc adding vagrant file back in, updated for sbt (622)
* 605c98f914a51661eb868a9d83adeaac3b6e2e37 Add flaky test retry
* 4ebbb41a08e73f731d556d97cf76a2df52a75b42 remove brittle dataset downloading from demos
* e572a9aa584616d249652a23f8bc218e3b64ebe6 try to Fix codecov upload
* fac542e2f6f80e51d8c62b5886b5804cc7481873 Add codecov to python tests
* b6ba62f4c6ae6d2e9a1d0df7bd9c3bf4e1c4cc52 Add test publishing tobuild
* 5cada6f78fee649adf2e7c413684b431edc8be23 Increase coverage and remove dead code
* ae191a6cb777ee7dde9572ff1bdf80e366a29a70 Fix build summary
* e18ec2e9cdf2af07c40682b5c228fb876001e8d5 leverage codecov.io's coverage capabilities
* 8e7626332f5da8757a12d2614ffb27b87ff3746f Improve noisy neighbor problems for e2e tests
* 6ab8916cc236dfc81c2d9b4d912f2903248083b8 add codecov file
* 70881b2930321019c48b175e38ed9b7998bdf9d4 improve test coverage
* 41da2b7af2bace4ce0715b50a1db050cd67207e3 improve flakiness
* aa3c98f22f26ea6f02eebaeea2ffa5a8d8e42cfe improve coverage
* 237d38821e9dbf23d6d187aa33b0de106066a724 Add Code Coverage badge
* 7146b9bc2af6da655b2c3061d9cf7edfcfdc517d Add unit test timeout
* fa87e427996ac270a9763b844d62411c610d48e6 Fix noisy neighbor search index tests
* 0f98f7df3169e4e648c5d01ecc54173baf8d8f10 add codeowners file
* 43218097e2b787b4b9009074b20a042e20367292 add codeowners file
* 80aecab8321423fb20c2d5bbc23362d514180472 Add upload to codecov.io
* 66db39fbed3e9660b9cdbf90afb065db9ce581d5 Split LGBM tests for speed
* a6998ec6b0fe068f064ad9600fa204c349b932b0 Update README.md
* 027e6d72f5473b8d570ca40385aad4019b39d15c Remove unused code
* 0205b7e692b70433775617e8013f665642df791e Squash with partition fix
* dc1554f00e0ed2829e65d0414da847ad59094e45 Add r package upload
* 2fbd81cacfcf5eaf526ca4f9f7332446c88836fe Fix pipeline retry
* 0fde5941b96e2993576a2453748fdca6bb6cb878 attempt to fix partition consolidator flakiness
* 7940967acb21c6fc77a05537c6cbdeb9db55da42 Add codecov
* 7e8225f7e34f7efa5bc44aa0e6731ab087424725 fix retry logic
* d8c0eb49080193aaa5ca36d0b39c9e65b9a4056e Increase timeout for e2e notebook tests
* ff059a310ef48aa408d1c01909526880376947d8 Add ability to retry pipeline
* 8cf91cabb166796726de86e81f64f0734a23c25a Simplify build pipeline
* 5c8c9032986138964f0d9d0acb6533ce3b8b8004 Delete runme
* 210b522324e93824bcf6e81897c81eb31d87a9b4 Update CNTK code in README
* da6e4977c1a1eb93495ec23ca97de18e34e6369a Update pipeline.yaml for Azure Pipelines
* e94631885c63de61b33dda7229902469e7d6bc12 Add build status bar
* 37d36af2acf66a46a1c44eec4ae403543061064f Enable PR builds
* 6c56326c1a5d78460052f51150ccaf70fd3b1f4c transition to new build system
* fb3e99e53d46ef5536dd2fa765e25b3d7ded07d8 Update dockerfile
* 637df9d34f508cd1c83542a69e922bc342b1fe0d Update documentation for new build
* e9ef538cdf75de1e243a21fb4a46e473d5f138a0 Improve test robustness
* d34f9d173d6f5cb0fbaa93a078bc339c28618549 Remove unused build scripts
* 4034a4fc9eeef54fac4f3710fdc738a904e026a7 Add doc publishing to build
* 36d8c3bd53686e94a8a054faf3f2efd161aa85eb Fixup after rebase
* 7c5e7b676974c21486704e71a3fa793d08f25d1c Get e2e tests working
* 07316a8c7db982f7f7b9cf9bc6793001c8cf9dbd Fix serialization fuzzing error
* f6df90771e93a209c4a846c462141de494c379ed Make recomendation tests faster
* dd99937b6eb3c023d2955a91f58e7133ca4bf248 Add python tests
* 02a8ac6c46acd0261c5b6bafa8a7ab4a05b14949 Add publish task
* 3a526c8c6ac0720e15ca22a7e0faeb24cac08bb6 Fix Test Errors and Improve Reliability
* 4a696c5548be2e505411b39a64af2bc669640a96 Parallelize Tests
* 2b75b62b8bd50239564ff5d1f50a94b003881bd2 Make build windows compatible
* 94e9b218a4bc1d6fc9134987d583924f4a83b983 Add developer-readme.md
* 5659287842bc09710076efe5fc5af2dcc82229a4 Fix python testing
* 987c7c49b9e10f9c3aa20f47c69fb133067387c9 Get python codegen to work
* 90089fa36a41260f8366d7ecce0cc24c06081f47 Add scalastyle and unidoc
* 79d41102fae2dd6e20f4aeafd77bdc9336ad1a24 Add secrets
* 5742c0e164d54f3b87e2e9007c249d45944f61ec Refactor build
* 77d7cb4f3c7f0c5eaf46883980754f9149d5d851 Move library into a single package
* 29c15cb52055d2598f25bd2249a738d0f2261c3d add barrier execution mode
* aac05361c454e4a4d383ca4f551f3a4051f1b35c fix default value for double array param in codegen
* 2bd2faf1295c8ffae43c9f528e676ddb2f0909ba fix wrapper generator for ranker models
* 6885ef5ea42942b6e134a341cd9f6f008e20e156 added lightgbm ranker model pyspark api
* 08b308585eefeebffb48df5857be1579bc6c5364 fix summarize data columns
* 044d0b5698fd99d30c874e3328a6b24cbda55acc reduce memory usage, fix frozen jobs, add more debug logging
* 45c91f98c7ed425beefec23bcd436690e1540dd7 defer lightgbm probability calculation to native core to fix multiclass bug in some scenarios (578)
* 44735200184151e180a3188fa315fa15a7fd18fa squish runs together
* 00ebf64bb34148d1cdc17f6108f31d471ec279c4 use right python version
* 216abea6317115d4a168cd533c1212ac2063bff3 updated readme. more mini images
* 3232d848d8de65a23a77908213ee9667f2c3a7a5 Fix flakey test
* e9a612bb803a346e8b3d3cbfdd18cc8f36653d39 Fix Entity Detector Suite
* ba3dbd0ea6eb654beb130bc79b9527ac62c2ef0e Improve service concurrency
* 75819a51fe88a16126e71bcb8f3376a8d8c4837e Add simple Anamoly Detector
* 17a765e6747dca6ab0f28cce047c7068bd3c31f2 Add `is_provide_training_metric` to LightGBMRanker.
* ceb52918c125ad844cf27fb812f30e9bcb5077ac Print metrics of validation data as well.
* b54363c9f78308505a25d0826c989326312b2c9a Implement `is_provide_training_metric` in Scala codes through JNI.
* c7e31e61fb93f198128a5777a5c786cdb9d8458f fix query column to support long type
* 6a6d57f40ecd25a23efae29b2d18671647dbdb3f Poke Build System
* 11fe799a3e6142c0788ec5a314d83e2c4f8cb1ee Fixing Cog Service Test
* 6eba0b6f4d612a35e4464bd955859efdf45eb803 ignore flaky test
* 53c4b9e0fd917b91cd7fb195ebe44822cdd212ee adding LightGBMRanker
* fa7785734a54c5e45c98c66196846be3e4682dbf add init score column for continued training
* 32ac35348312e57599c9275fcdba800765efc638 Add anomaly detection and speech to text services
* 06273b252d753be61c353a15a2a20455c92e3af2 improved compute model statistics error message
* e7a309c3d9ea0462cfd055e2d794cae7dfbe5fca pass through slot names to native structure
* b295dae1a53c7fe127a498e974554f854b316075 add batch training support in lightgbm classifier and regressor
This list of changes was [auto generated](https://msazure.visualstudio.com/Cognitive%20Services/_build/results?buildId=24338255&view=logs).</details>