* improve document extension list * add a few more video extension * Implement relative links (thanks Sebastian Nagel) * add filename and url metadata (thanks marianna13) * add filename and url metadata
1.4.0
* Add text and video document types
1.3.1
* Rename to cc2dataset
1.3.0
* Support audio document type * Restart spark session for each part. * Improve error handling and logging. * Implement resume + speed up by reading file from s3 all at once.
1.2.0
* Add try catch on archive for broken wat. * Implement multipart. * Shuffle + use date as output path + write wat index files + shuffle input wat