Adds new Anthropic Claude 3 models.
* Backend now uses the `messages` API for Claude 2.1+ models.
* Adds the `system` message parameter in Claude settings.
Adds browser-sandboxed Python with [pyodide](https://pyodide.org/en/stable/)
You can now run Python in a safe sandbox entirely in the browser, provided you do not need to import third-party libraries.
The **web-hosted version** at [chainforge.ai/play](https://chainforge.ai/play/) now has Python evaluators unlocked:
<img width="1661" alt="Screen Shot 2024-03-05 at 11 08 46 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/a05ec44e-c99c-426e-a9e7-23b42017b359">
The **local version** of ChainForge includes a toggle to turn sandboxing on or off:
<img width="402" alt="Screen Shot 2024-03-03 at 9 23 18 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/1e2f6be3-2b63-4f57-9c0f-690a7fd62a4b">
If you turn sandboxing off, you go back to the previous Python evaluator, executed on your local machine through the Flask backend. In the non-sandboxed eval node you can import any libraries available in your Python environment.
Why sandboxing?
The benefit of sandboxing is that ChainForge can now be used to execute Python code generated by LLMs, using eval() or exec() in your evaluation function. This was possible before but dangerous and unsafe. Benchmarks that do not rely on third-party libraries, like HumanEvals at pass1 rate, could be run within ChainForge entirely in the web browser (if anyone wants to set this up, let me know!).
add-prettier
Hi folks,
Thanks to PRs https://github.com/ianarawjo/ChainForge/pull/223 and https://github.com/ianarawjo/ChainForge/pull/222 by massi-ang we have added Prettier and ESLint to ChainForge's `main` branch.
`prettier` and `eslint` are now run upon `npm run build`, and you are encouraged to run them before suggesting any PRs onto the ChainForge `main` branch.
We know this is somewhat annoying to anyone building on top of ChainForge, because it may make rebasing on top of latest `main` changes a chore. This includes myself---the changes in `multi-eval` branch, which I have been working on for a while now, are even harder to merge. However, the addition of consistent formatting and linting provides better standards for developer contributions, beyond the ad-hoc approach to writing code we had before.
Recently, I have had less time for code hygiene tasks for this project. However, I think **converting the entire front-end code to TypeScript** is the next step. This would provide more guarantees on dev contributions, may catch existing bugs, and allows us to have a standardized `ResponseObject` format across ChainForge that is enforced and extendable. The latter:
- would provide guarantees on format for people adding their own widgets, about what type of responses are
- would be easily extensible to additional data formats like images as input for GPT4-Vision, or images as responses for DALL-E
Additionally, I envision:
- Better encapsulation of how responses are displayed in Inspectors, i.e. a dedicated React component like `ResponseBox` that can then be extended to handle image outputs, if present.
- Better storage for internal responses (i.e. the one with “query”) that minimizes repeated info for LLM settings. Duplicate info in LLM settings is inflating the size of files fast. LLM at particular settings should be uids to a lookup table.
- Better / updated example flows, e.g. comparing prompts, testing JSON format, multiple evaluations
- Dev docs for how to create a new node
It doesn't seem like LLMs are going anywhere, and evaluating their output quality still suffers from the same issues. If we work together, we can make ChainForge a go-to graphical interface for "testing stuff out" ---rapid prototyping of prompt and chain ideas and rapid testing of LLM behavior, beyond ad-hoc chatting, CLIs or having to type code.
ChainForge is based on transparency and complete control. We always intend to [show the prompts](https://hamel.dev/blog/posts/prompt/) to developers. Developers should have access to the exact settings used for the model, too. If ChainForge adds, say, prompt optimization, it will be important to always show the prompts.
Let us know what you think of these changes, or what you'd like to see in the future. If you are a developer, **please consider contributing!** :)