Skip to main content

Claude 3.5 Sonnet: Computer use

LizAbout 1 minLLMClaudeComputer Use

Claude 3.5 Sonnet: Computer use

    1. About
    1. Industry Benchmarks
    1. API Support
    1. Features of Computer Use
    1. Future directions

1. About

Company:Anthropic

Official Articles:

1. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haikuopen in new window

2. Developing a computer use modelopen in new window

Official Demo Videos:

https://www.youtube.com/playlist?list=PLf2m23nhTg1NcDu3_eZavbTJ3Dow9QQRB

Code Implementation:

https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo

2. Industry Benchmarks

  • SWE-bench Verified(coding)
  • TAU-bench(tool use)
  • OSWorld (computer control)

Claude 3.5 Performance Evaluation Results:

For OSWorld, Claude 3.5 Sonnet scored 14.9% in the screenshot-only category.

Claude3.5 Performance Evaluation Results
Claude3.5 Performance Evaluation Results

3. API Support

Support for calls via Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI

  • Anthropic API
  • Amazon Bedrock
  • Google Cloud’s Vertex AI

4. Features of Computer Use

Use computers the way people do — by looking at a screen, moving a cursor, clicking buttons, and typing text.

    1. translate instructions into computer commands (e.g. fill out a form with the data from those pages; and so on)
    • user commands
      • use data from my computer and online to fill out this form
    • translate into a series of computer commands
      • check a spreadsheet;
      • move the cursor to open a web browser;
      • navigate to the relevant web pages;
      • fill out a form with the data from those pages.
    1. to interpret what’s happening on a screen and then use the software tools available to carry out tasks
    1. Position the cursor to a specific location on the screen
    • calculate how many pixels the cursor needs to move vertically or horizontally to click on the right spot
    • Training Claude to count pixels accurately was critical.
    • Claude trained on this.
    1. self-correct and retry tasks when it encountered obstacles

5. Future directions

  • Improve safety
  • Reduce costs
  • Increase speed
  • Reduce Error Rate and increase Reliability
  • Support more operations
    • Operations such as drag and drop, zoom, etc. are not yet supported.
  • Performance Improvement of Visual Large Language Models (VLMs)
    • Current VLMs are generally ‘myopic’ in the sense that they can only extract approximate and abstract visual information from an image, and do not really see the picture as a human would.
  • Changing viewing screenshots to finer-grained video streams
    • Claude observes the screen by taking screenshots that are ‘flip-book’ style, i.e. taking screenshots and piecing them together, rather than observing a finer-grained video stream, which means that it may miss brief actions or notifications.
  • Combination of visual and code manipulation
    • Relying on vision alone doesn't improve speed or accuracy enough