Pyspider

Latest version: v0.3.10

Safety actively analyzes 685838 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 3

0.3.4

Global
- New message queue support: [beanstalkd](http://kr.github.io/beanstalkd/) by tiancheng91
- New global argument: `--logging-config` to specify a customization logging config (to disable werkzeug logs for instance). You can get a sample config from pyspider/logging.conf).
- Project `group` info is added to task package now.
- Change docker base image to cmfatih/phantomjs, you can use phantomjs with same docker image now.
- Auto restart phantomjs if crash, only enabled in all mode by default.

WebUI
- Show next `exetime` of a task in task page.
- Show fetch time and process time in tasks page.
- Show average fetch time and process time in 5min in dashboard page.
- Show message queue status in dashboard page.
- `limit` and `offset` parameter support in result dump.
- Fix frontend bug when crawling pages with dataurl.

Other
- Fix support for phantomjs 2.0.
- Fix scheduler project update inform not work, and use md5sum of script as another signal.
- Scheduler: periodic counter report in log.
- Fetcher: fix for legacy version of pycurl

0.3.3

API
- self.crawl will raise TypeError when get unexcepted arguments
- self.crawl not accapt cURL command as first argument, see http://docs.pyspider.org/en/latest/apis/self.crawl/curl-command.

WEBUI
- A new css selector tool bar is added, the pre-generated css selected pattern can be modified and added/copy to script.

Benchmarking
- The database table for bench test will be cleared before and after bench test.
- insert/update/get bench test for database and put/get test for message queue is added.

Other
- The default message queue is switched to ampq.
- docs fix.

0.3.2

Scheduler
- The size of task queue is more accurate now, you can use it to determine all done status of scheduler.

Fetcher
- Fix tornado loss cookies while doing 30x redirects
- You can use cookies with cookie header at same time now
- Fix proxy not working bug.
- Enable proxy by default.
- Proxy now support username and password authorization. soloradish
- Etag and Last-Modified header will be disabled while last crawl is failed.

Databases
- MySQL default engine changed to InnoDB laapsaap
- MySQL, larger result column size, changed to MEDIUMBLOB(up to 16M) laapsaap

WebUI
- WebUI will use same arguments as the fetcher, fix proxy not word for webui bug.
- Results will be sorted in the order of updatetime.

One Mode
- Script exception logs would be printed to screen

New Command `send_message`

You can use the command `pyspider send_message [project] [message]` to send a message to project via command-line.

Other
- Using localhosted test web pages
- Remove version specify of lxml, you can use apt-get to install any version of lxml

0.3.1

One Mode

One mode not only means all-in-one, it runs every thing in one process over tornado.ioloop. One mode is designed for debug purpose. You can test scripts written in local files and using `--interactive` to choose a task to be tested.

With `one` mode you can use `pyspider.libs.utils.python_console()` to open an interactive shell in your script context to test your code.

full documentation: http://docs.pyspider.org/en/latest/Command-Line/one
- bug fix

0.3.0

- A lot of bug fixed.
- Make pyspider as a single top-level package. (thanks to zbb, iamtew and fmueller from HN)
- Python 3 support!
- Use [click](http://click.pocoo.org/) to create a better command line interface.
- Postgresql Supported via SQLAlchemy (with the power of SQLAlchemy, pyspider also support Oracle, SQL Server, etc).
- Benchmark test.
- Documentation & tutorial: [http://docs.pyspider.org/](http://docs.pyspider.org/)
- Flake8 cleanup (thanks to jtwaleson)

Base
- Use messagepack instead of pickle in message queue.
- JSON data will encoding as base64 string when content is binary.
- Rabbitmq lazy limit for better performance.

Scheduler
- Never re-crawl a task with a negative age.

Fetcher
- `proxy` parameter support `ip:port` format.
- increase default fetcher poolsize to 100.
- PhantomJS will return JS script result in [`Response.js_script_result`](http://docs.pyspider.org/en/latest/apis/Response/responsejs_script_result).

Processor
- Put multiple new tasks in one package. performance for rabbitmq.
- Not store all of the headers when success.

Script
- Add an interface to generate taskid with task object. [`get_taskid`](http://docs.pyspider.org/en/latest/apis/self.crawl/other)
- Task would be de-duplicated by project and taskid.

Webui
- Project list sortable.
- Return 404 page when dump a not exists project.
- Web preview support image

0.2.0

Base
- mysql, mongodb backend support, and you can use a database uri to setup them.
- rabbitmq as Queue for distributed deployment
- docker supported
- support for Windows
- support for python2.6
- a resultdb, result_worker and WEBUI is added.

Scheduler
- cronjob task supported
- delete project supported

Fetcher
- a phantomjs fetcher is added. now you can fetch pages with javascript/ajax technology!

Processor
- `send_message` api to send message to other projects
- now you can import other project as module via `from projects import xxxx`
- `config` helper for setting configs for a callback

WEBUI
- a css selector helper is added to debugger.
- a option to switch JS/CSS CDN.
- a page of task history/config
- a page of recent active tasks
- pages of results
- a demo mode is added for http://demo.pyspider.org/

Others
- bug fixes
- more tests, coverage is used.

Page 2 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.