Chainforge

Latest version: v0.3.2.1

Safety actively analyzes 682361 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 4

0.2

> **Note**
> This release includes a breaking change regarding cache'ing responses. If you are working on a current flow, export your ChainForge flow to a `cforge` file before installing the new version.

We're closer than ever to hosting ChainForge on [chainforge.ai](http://chainforge.ai), so that no installation is required to try it out. Latest changes below.

The entire backend has been rewritten in TypeScript πŸ₯·πŸ§‘β€πŸ’»οΈ

Thousands of lines of Python code, comprising nearly the entire backend, has been rewritten in TypeScript. The mechanism for generating prompt permutations, querying LLMs and cache'ing responses is performed now in the front-end (entirely in the browser). Tests were added in `jest` to ensure the outputs of the TypeScript functions performed the same as their original Python versions. There are additional performance and maintainability benefits to adding static type checking. We've also added ample docstrings, which should help devs looking to get involved.

Functionally, you should not experience any difference (expect maybe a slight speed boost).

Javascript Evaluator Nodes 🧩

Because the application logic has moved to the browser, we added JavaScript evaluator nodes. These let you write evaluation functions in JavaScript, and function the same as Python evaluators.

Here is a side-by-side comparison of JavaScript and Python evaluator nodes, showing semantically equivalent code and the in-node support for displaying console.log and print output:

<img width="678" alt="Screen Shot 2023-06-30 at 12 08 27 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/09da964e-fd07-4cf2-a4c7-04fc0080b722">

When you are running ChainForge on `localhost`, you can still use Python evaluator nodes, which will execute on your local Flask server (the Python backend) as before. JavaScript evaluators run entirely in the browser (specifically, `eval` sandboxed inside an `iframe`).

HuggingFace Models πŸ€—

We added support for querying text generation models hosted on the [HuggingFace Inference API](https://huggingface.co/inference-api). For instance, here is [falcon.7b.instruct](https://huggingface.co/tiiuae/falcon-7b-instruct), an open-source model:

<img width="1107" alt="Screen Shot 2023-06-30 at 2 15 46 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/344fbc65-f4a4-4b9f-9496-3ddb427db34c">

For HF models, there is a 250 token limit. This can sometimes be rather limiting, so we've added a "number of continuations" setting to help with that. You can set it to > 0 to feed the response back into the API for text completions models, which will generate longer completions, for up to 1500 tokens.

We also support [HF Inference Endpoints](https://huggingface.co/inference-endpoints) for text generation models. Simply put the API call URL in the `custom_model` field of the settings window.

Comment Nodes ✏️

You can write comments about your evaluation using a comment node:

<img width="306" alt="Screen Shot 2023-06-30 at 2 18 03 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/e96df294-4b47-4575-9559-61883973d238">

'Browser unsupported' error πŸ’’

If you load ChainForge on a mobile device or unsupported browser, it will now display an error message:

<img width="500" alt="Screen Shot 2023-06-30 at 2 28 32 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/ecfc0b79-9859-4612-8ad2-f8f9bc459469">

This helps for our public release. If you'd like ChainForge to support more browsers, open an Issue or (better yet) make a Pull Request.

Fun example

Finally, I wanted to share a fun practical example: an evaluation to **check if the LLM reveals a secret key**. This evaluation, including all API calls and JavaScript evaluation code, was run entirely in the browser:

<img width="1788" alt="Screen Shot 2023-06-30 at 2 47 39 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/36cab316-419b-4257-980b-f6f6a3c82571">

Questions, concerns?

Open an Issue or start a Discussion!

This was a major, serious change to ChainForge. Although we've written tests, it's possible we have missed something, and there's a bug somewhere. **Note that unfortunately, Azure OpenAI πŸ”· support is again untested following the rewrite, as we don’t have access to it. Someone in the community, let me know if it works for you! (Also, if you work at Microsoft and can give us access, let us know!)**

A browser-based, hosted version of ChainForge will be publicly available July 5th (next Wednesday) on chainforge.ai πŸŒπŸŽ‰

0.1.7.2

This minor release includes two features:

Autosaving

Now, ChainForge autosaves your work to `localStorage` every 60 seconds.
This helps tremendously in case you accidentally close the window without exporting the flow, your systme crashes, or you encounter a bug.

To create a new flow now, just click the New Flow button to get a new canvas.

Plots now have clear with y-axis, x-axis, groupBy selectors on Vis Node

We've added a header bar to the Vis Node, clarifying what is plotted on each axis / dimension:

<img width="588" alt="Screen Shot 2023-06-23 at 9 32 58 AM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/74fc0f47-9390-4937-836d-77d24daad380">

In addition, as you see above, the y-labels can be up to two lines (~40 chars long), making it easier to read.

Finally, when num of generations per prompt is 1, we now output bar charts by default:

<img width="729" alt="Screen Shot 2023-06-23 at 9 35 06 AM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/7a1266b2-622a-480a-938d-889950c6e90e">

Box-and-whiskers plots are still used whenever num generations n > 1.

Note that improving the Vis Nodes is a work-in-progress, and functionally, everything is the same as before.

0.1.7

We've made a number of improvements to the inspector UI and beyond.

Side-by-side comparison across LLM responses
Responses now appear side-by-side for up to five LLMs queried:

<img width="1387" alt="Screen Shot 2023-06-21 at 9 27 45 AM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/e739845c-cee5-422a-8567-505a195331dc">

Collapseable response groups
You can also collapse LLM responses grouped by their prompt template variable, for easier selective inspection. Just click on a response group header to show/hide:

https://github.com/ianarawjo/ChainForge/assets/5251713/452ab3ae-7a74-4b6c-a568-a6f14351b93d

Accuracy plots by default

Boolean (true/false) evaluation metrics now use accuracy plots by default. For instance, for ChainForge's prompt injection example:

<img width="602" alt="Screen Shot 2023-06-21 at 9 27 58 AM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/2509ca98-3b88-4b36-9e8c-8078b854871a">

This makes it extremely easy to see differences across models for the specified evaluation. Stacked bar charts are still used when a prompt variable is selected. For instance, here is plotting a meta-variable, 'Domain', across two LLMs, testing whether or not the code outputs had an `import` statement (another new feature):

<img width="487" alt="Screen Shot 2023-06-21 at 10 22 51 AM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/41158437-ad54-4ba2-a989-a5fe071d6408">

Added 'Inspect results' footer to both Prompt and Eval nodes

The tiny response previews footer in the Prompt Node has been changed to 'Inspect Responses' button that brings up a fullscreen response inspector. In addition, evaluation results can be easily inspected by clicking 'Inspect results':

<img width="1560" alt="Screen Shot 2023-06-21 at 10 12 34 AM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/a3b642a7-ca34-475b-b8e7-a42b3f51d03c">

Evaluation scores appear in bold at the top of each response block:

<img width="1392" alt="Screen Shot 2023-06-21 at 10 13 54 AM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/af4c9e00-576c-4dfd-9308-42f985d46471">

In addition, both Prompt and Eval nodes now load cache'd results upon initialization. Simply load an example flow and click the respective Inspect button.

Added `asMarkdownAST` to `response` object in Evaluator node

Given how often developers wish to parse markdown, we've added a function `asMarkdownAST()` to the `ResponseInfo` class that uses the [`mistune` library](https://mistune.lepture.com/en/latest/) to parse markdown as an abstract syntax tree (AST).

For instance, here's code which detects if an 'import' statement appeared anywhere in the codeblocks of a chat response:

<img width="510" alt="Screen Shot 2023-06-21 at 10 19 51 AM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/c12c46e3-3371-415b-8ae5-c5819a24fd6a">

0.1.6

Added 188 OpenAI Evals to Example Flows

We've added **188** example flows generated directly from [OpenAI evals](https://github.com/openai/evals) benchmarks.
In Example Flows, navigate to the "OpenAI Evals" tab, and click the benchmark you wish to load:

https://github.com/ianarawjo/ChainForge/assets/5251713/7a498255-3f44-411a-ae9c-dfdb4b789a7b

The code in each Evaluator is the appropriate code for each evaluation, as referenced from the [OpenAI eval-templates doc](https://github.com/openai/evals/blob/main/docs/eval-templates.md).

Example: Tetris problems
For example, I was able to compare GPT-4's performance on `tetris` problems with GPT3.5, simply by loading the eval, adding GPT-4, and pressing run:

<img width="1691" alt="Screen Shot 2023-06-15 at 4 10 36 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/8ea3b4e9-8fbd-44e2-88b1-f9c930717916">

I was curious whether the custom system message had any effect on GPT3.5's performance, so I added a version without it, and in 5 seconds found out that the system message had no effect:

<img width="1684" alt="Screen Shot 2023-06-15 at 4 13 38 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/fbed5c5f-b0a9-4fe8-a910-8b1a1b01385b">

Supported OpenAI evals

A large subset of OpenAI evals are supported. We currently display OpenAI evals with:
- a common system message
- a single 'turn' (prompt)
- evaluation types of 'includes', 'match', and 'fuzzy match',
- and a reasonable number of prompts (e.g., spanish-lexicon, which is not included, has 53,000 prompts)

We hope to add those with model evaluations (e.g., Chain-of-thought prompting) in the near future.

The `cforge` flows were precompiled from the `oiaevals` registry. To save space, the files are not included in the PyPI chainforge package, but rather fetched from GitHub on an as-needed basis. We precompiled the evals to avoid forcing users to install OpenAI evals, as it requires Git LFS, Python 3.9+, and a large number of dependencies.

Note finally that responses are not cache'd for these flows, unlike the other examples --you will need to query OpenAI models yourself to run them.

-----------------
Minor Notes
This release also:
- Changed `Textareas` to contenteditable `p` tags inside Tabular Data Nodes. Though this compromises usability _slightly_, there is a huge gain in performance when loading large tables (e.g., 1000 rows or more), which is required for some OpenAI evals in the examples package.
- Fixed a bug in `VisNode` where a plot was not displaying when a single LLM was present, the number of prompt variables >= 1, and no variables were selected

If you run into any problems using OpenAI evals examples, or with any other part of CF, please let us know.
We could not manually test all of the new example flows, due to how many API calls would be required. Happy ChainForging!

0.1.5.3

This is an emergency release to add basic support for the new OpenAI models and 'function call' ability. It also includes support for Azure OpenAI endpoints, closing Issue 53 .

OpenAI function calls
You can now specify the newest models of ChatGPT, 0613:

<img width="545" alt="Screen Shot 2023-06-13 at 5 36 50 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/a47cbb12-2744-4566-9781-09fe4eeb5ce2">

In addition, you can set the value of `functions` by passing a valid JSON schema object. This will be passed to the `functions` of the OpenAI chat completions call:

<img width="646" alt="Screen Shot 2023-06-13 at 5 36 45 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/6106aab6-88f6-488e-a6ad-eff6b4108870">

I've created a basic example flow to **detect when a given prompt triggers a function call**, using OpenAI's `get_current_weather` example in their press release:

<img width="1432" alt="Screen Shot 2023-06-13 at 5 39 50 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/a4d9bd0f-0f51-4a48-9784-c5e3ac342f2b">

In the coming weeks, we will think about making this user experience more streamlined, but for now, enjoy being able to mess around!

Azure OpenAI API support

Thanks to community members chuanqisun , bhctest123 , and levidehaan , we now have added Azure OpenAI support:

![245616817-23e0fcb3-5cee-4d76-8eeb-eb83f5b5fabc](https://github.com/ianarawjo/ChainForge/assets/5251713/df25a6c7-5a52-4274-8a46-d1c9c39a2cc2)

To use Azure OpenAI, you just need to set your keys in ChainForge Settings:

<img width="478" alt="Screen Shot 2023-06-13 at 5 57 24 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/34db6db5-a447-4745-a425-5cc9077e6af7">

And then make sure you set the right Deployment Name in the individual model settings. The settings also includes OpenAI function calls (not sure if you can deploy 0613 models on Azure yet, but it's there).

As always, let us know if you run into any issues.

Collapsing duplicate responses

As part of this release, duplicate LLM responses when num generations `n>1` are now detected and automatically collapsed in Inspectors. The number of duplicates is indicated in the top-right corner:

<img width="386" alt="Screen Shot 2023-06-13 at 12 03 54 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/6f489ff0-6f33-438f-be25-102ce69deb15">

0.1.5

We've added Tabular Data to ChainForge, to help conduct ground truth evaluations. Full release notes below.

Tabular Data Nodes πŸ—‚οΈ

You can now input and import tabular data (spreadsheets) into ChainForge. Accepted formats are `jsonl`, `xlsx`, and `csv`. Excel and CSV files must have a header row with column names.

Tabular data provides an easy way to enter associated prompt parameters or import existing datasets and benchmarks. A typical use case is **ground truth evaluation**, where we have some inputs to a prompt, and an "ideal" or expected answer:

<img width="1377" alt="Screen Shot 2023-06-10 at 2 23 13 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/e3dd6941-47d4-4eee-b8b1-d9007f7aae15">

Here, we see **variables `{first}`, `{last}`, and `{invention}` "carry together" when filling the prompt template**: ChainForge knows they are all associated with one another, connected via the row. Thus, it constructs 4 prompts from the input parameters.

Accessing tabular data, even if it's not input into the prompt directly
Alongside tabular data is a new property of `response` objects in Evaluation nodes: the `meta` dict. This allows you to get access to column data that is associated with inputs to a prompt template, _but was not itself directly input into the prompt template_. For instance, in the new example flow for ground truth evaluation of math problems:

<img width="1770" alt="Screen Shot 2023-06-11 at 11 51 28 AM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/1611a9e4-c7d8-4f3f-92ff-a7c41bb230cf">

Notice the evaluator uses `meta` to get "Expected", which is _associated_ with the prompt input variable `question` by virtue of it being on the same row of the table.

python
def evaluate(response):
return response.text[:4] == \
response.meta['Expected']


Example flows

Tabular data allows us to run many more types of LLM evaluations. For instance, here is the ground truth evaluation `multistep-word-problems` from [OpenAI evals](https://github.com/openai/evals), loaded into ChainForge:

<img width="1465" alt="Screen Shot 2023-06-10 at 9 08 05 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/12609e9f-e23e-4028-9b7e-ee2dc7d31147">

We've added an Example Flow for ground truth evaluation that provides a good starting point.

--------------------------
Evaluation Node output πŸ“Ÿ

Curious what the format of a `response` object is like? You can now `print` inside `evaluate` functions to print output directly to the browser:

<img width="375" alt="Screen Shot 2023-06-10 at 8 26 48 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/264a5661-4ae9-4468-9fd6-607ab95aa1f5">

In addition, Exceptions raised inside your evaluation function will also print to the node out:

<img width="377" alt="Screen Shot 2023-06-10 at 8 29 38 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/70a96161-6fce-451b-9219-4a4aa31948bd">

--------------------------
Slight styling improvements in Response Inspectors

We removed the use of blue Badges to display unselected prompt variable and replaced them with text that blends into the background:

<img width="327" alt="Screen Shot 2023-06-11 at 12 52 51 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/8f42ea57-7de0-4d1e-8a11-bb68e689baae">

The fullscreen inspector also displays slightly larger font size for readability:

<img width="1461" alt="Screen Shot 2023-06-11 at 12 51 49 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/3c2ab466-09d8-4edd-a5a3-bcc6c32e34cd">

--------------------------
Final thoughts / comments
- Tabular Data was a major feature, as it enables many types of LLM evaluation. Our goal now is to illustrate what people can currently do in ChainForge through better documentation and connecting to existing datasets (e.g. OpenAI evals). We also will focus on quality-of-life improvements to the UI and adding more models/extensibility.
- We know there is a minor layout issue with the table not autosizing to the best fit the width of cell content. This happens as some browsers do not appear to autofit column widths properly when `<textarea>` is an element of a table cell. We are working on a fix so columns are automatically sized based on their content.

Want to see a feature / have a comment? Start a [Discussion](https://github.com/ianarawjo/ChainForge/discussions) or submit an [Issue](https://github.com/ianarawjo/ChainForge/issues)!

Page 3 of 4

Β© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.