Tarsier

Latest version: v0.5.94

Safety actively analyzes 626904 Python packages for vulnerabilities to keep your Python projects secure.

0.5.0

What's Changed
* Tag interfering with Xpath fix by KhoomeiK in https://github.com/reworkd/tarsier/pull/14
* Bump mypy from 1.7.0 to 1.7.1 by dependabot in https://github.com/reworkd/tarsier/pull/13
* fixed leaf text tagging by KhoomeiK in https://github.com/reworkd/tarsier/pull/16
* Tagging improvements by KhoomeiK in https://github.com/reworkd/tarsier/pull/18

New Contributors
* KhoomeiK made their first contribution in https://github.com/reworkd/tarsier/pull/14
* dependabot made their first contribution in https://github.com/reworkd/tarsier/pull/13

**Full Changelog**: https://github.com/reworkd/tarsier/compare/v0.4.0...v0.5.0

0.4.0

🎉 What's Changed
* ✍️ Fix readme citation link by Krupskis in https://github.com/reworkd/tarsier/pull/3
* ✍️Fix Citation Repository URL in Readme by debanjum in https://github.com/reworkd/tarsier/pull/4
* 🚀 Remove Annotations and Tag All text elements (optionally) by awtkns in https://github.com/reworkd/tarsier/pull/8
* 🆑 Make spans have red background with white text by awtkns in https://github.com/reworkd/tarsier/pull/9

👀 New Contributors
* Krupskis made their first contribution in https://github.com/reworkd/tarsier/pull/3
* debanjum made their first contribution in https://github.com/reworkd/tarsier/pull/4
* awtkns made their first contribution in https://github.com/reworkd/tarsier/pull/8

**Full Changelog**: https://github.com/reworkd/tarsier/compare/v0.3.1...v0.4.0

0.3.1

<img src="https://raw.githubusercontent.com/reworkd/Tarsier/main/.github/assets/tarsier.png" height="300" alt="Tarsier Monkey" />


🙈 Vision utilities for web interaction agents 🙈


<img alt="Python" src="https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54" />


<a href="https://reworkd.ai/">🔗 Main site</a>
  •  
<a href="https://twitter.com/reworkdai">🐦 Twitter</a>
  •  
<a href="https://discord.gg/gcmNyAAFfV">📢 Discord</a>


Announcing Tarsier
If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:
- How do you map LLM responses back into web elements?
- How can you mark up a page for an LLM better understand its action space?
- How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects.
Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier!
The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.

https://github.com/reworkd/tarsier/assets/50181239/af12beda-89b5-4add-b888-d780b353304b

How does it work?
Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as `[1]`.
In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon.
We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs.
This is important to note given performance issues with existing vision language models.
Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Usage
Visit our [cookbook](https://github.com/reworkd/Tarsier/tree/main/cookbook) for agent examples using Tarsier:
- [An autonomous LangChain web agent](https://github.com/reworkd/tarsier/blob/main/cookbook/langchain-web-agent.ipynb) 🦜⛓️
- [An autonomous LlamaIndex web agent](https://github.com/reworkd/tarsier/blob/main/cookbook/llama-index-web-agent.ipynb) 🦙

Otherwise, basic Tarsier usage might look like the following:
python
import asyncio

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService

async def main():
google_cloud_credentials = {}

ocr_service = GoogleVisionOCRService(google_cloud_credentials)
tarsier = Tarsier(ocr_service)

async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://news.ycombinator.com")

page_text, tag_to_xpath = await tarsier.page_to_text(page)

print(tag_to_xpath) Mapping of tags to x_paths
print(page_text) My Text representation of the page

if __name__ == '__main__':
asyncio.run(main())

Supported OCR Services
- [x] [Google Cloud Vision](https://cloud.google.com/vision)
- [ ] [Amazon Textract](https://aws.amazon.com/textract/) (Coming Soon)
- [ ] [Microsoft Azure Computer Vision](https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/) (Coming Soon)

Special shoutout to KhoomeiK for making this happen! ❤️

Releases

Has known vulnerabilities