- Rewrote `Scraper` to take in the url, rather than the functions that it contains. This means that pages are only fully scraped once, rather than scraped once for each function. Previously, a website would be opened for every method on it (it would open once to scrape images, another time to scrape tables, etc.). Now, it is opened only once and contents are scraped from a single open session, not across multiple. This reduces the number of "hits" each webpage bears when being scraped and improves runtime.
- For example, instead of doing the following:
python
scraper = Scraper()
scraper.parse_tables("www.example.com")
scraper.parse_images("www.example.com")
scraper.parse_lists("www.example.com")
Do this instead:
python
scraper = Scraper("www.example.com")
scraper.parse_tables()
scraper.parse_images()
scraper.parse_lists()
- The rewrite also makes the output of `_soup_url` a private class attribute (`soup_url`), rather than a private method.
- Rewrote `print_tree` to not lean on `tree_gen`. The functionality of `tree_gen` remains the same as before (generates an `anytree` tree and returns the head node), but `print_tree` now takes in a parameter for the depth of the tree, not the head `Node` itself.
- For example, instead of doing the following:
python
scraper = Scraper()
scraper.print_tree(scraper.tree_gen("www.example.com", 2))
Do this instead:
python
scraper = Scraper("www.example.com")
scraper.print_tree(2)
- Updated docs and demo recording to match the rewrite.
- Added a destructor to the `Scraper` class.
- Added an xlsx option for `parse_tables`