ICML 2024:
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

Xueyu Hu1, Ziyu Zhao1, Shuang Wei2, Ziwei Chai1, Qianli Ma3,
Guoyin Wang3, Xuwu Wang3, Jing Su3, Jingjing Xu3, Ming Zhu3, Yao Cheng3, Jianbo Yuan3
Jiwei Li1, Kun Kuang1, Yang Yang1, Hongxia Yang3, Fei Wu1
1Zhejiang University 2Rochester Institute of Technology 3ByteDance Inc.

InfiAgent-DABench is a project to build and evalute agents for advanced data analysis. Agent evaluation has been an open and challenging problem.

Leaderboard

In this section, we furnish a comprehensive evaluation of both close-source LLMs such as GPT-4 and GPT-3.5, as well as widely-utilized open-source LLMs. In addition, we test our DAAgent, which is an agent for data analysis with instruction-tuning.

Notice: We set temperature=0.2, top_p=1.0 and frequency_penalty=0.0 for all the models.


Rank Model Name # Params. (in B) Proportional Accuracy by Subquestions Accuracy by Questions Uniform Accuracy by Subquestions
1 gpt-4-0613 / 74.60% 78.72% 79.01%
2 daagent-34b 34 60.77% 67.50% 62.98%
3 abab5.5-chat / 60.13% 65.94% 64.27%
4 gpt-3.5-turbo-0613 / 58.20% 65.70% 61.88%
5 qwen-72b-chat 72 57.56% 62.46% 56.17%
6 daagent-13b 13 54.52% 60.51% 55.72%
7 gemini-pro / 53.38% 58.32% 51.93%
8 mixtral-8x7b-instruct-v0.1 46.7(12.9) 49.20% 54.02% 51.38%
9 daagent-7b 7 49.02% 57.63% 54.19%
10 deepseek-coder-33b-instruct 33 44.84% 48.90% 46.31%
11 claude-2.1 / 43.41% 49.65% 46.96%
12 phind-codellama-34b-v2 34 42.02% 47.41% 44.40%
13 xwincoder-34b 34 39.87% 45.46% 42.73%
14 mistral-7b-instruct-v0.2 7 36.45% 41.61% 36.90%
15 qwen-14b-chat 14 36.45% 41.36% 34.51%
16 qwen-7b-chat 7 27.27% 27.27% 16.00%
17 vicuna-13b-v1.5 13 26.26% 30.88% 26.31%
17 internlm-chat-20b 20 23.79% 26.82% 25.05%
17 wizardcoder-python-34b-v1.0 34 23.13% 26.45% 22.40%
18 agentlm-7b 7 16.99% 20.71% 17.89%
19 chatglm3-6b 6 16.67% 20.59% 19.27%
20 codellama-34b-instruct 34 14.56% 17.39% 13.94%

Overview

The advent of Large Language Models (LLMs) has spurred the development of LLM-augmented Autonomous Agents (LAAs). These agents are capable of generating and executing code through ongoing interactions between their core LLM and the code execution environment. In this project, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents in data analysis tasks. This benchmark contains DAEval, a dataset consisting of data analysis questions derived from CSV files, and an agent framework to evaluate LLMs as data analysis agents. This page describes the details of InfiAgent-DABench framework, including features such as dataset construction, evaluation metrics, analytical assessment, and the procedural details about pipeline onboarding.

Dataset Construction

We build data analysis query and response given existing csv files. Here is the construction pipeline. We split the dataset into a validation set and a test set. The validation dataset contains 311 questions with 55 csv files. We only public the validation set to avoid data leakage. Here're some examples: We categorize CSV files within the dataset into nine distinct categories, determined by their respective domains:
  • Finance and Economics
  • Health and Medical
  • Demographics and Social Science
  • Marketing and Consumer Behavior
  • Energy and Environmental Monitoring
  • Transportation, Logistics, and Tourism
  • Culture, Entertainment, and Media
  • Scientific Research and Technology
  • Other Categories

Below is the pie chart depicting the categorical distribution:

We conduct statistical analyses on the individual concepts associated with each question, accounting for scenarios where a question encompasses multiple concepts:

Evaluation

For closed-form questions, we prompt LLMs with question description. Considering that most models hardly follow the format requirements, we add a reformat step for all models by using gpt-3.5-turbo-16k to format the responses given the format requirements. Here's a figure illustrating this process:

Metrics

For closed-form questions, we have defined the following metrics: Proportional Accuracy by Subquestions (PASQ): $$ \text{PSAQ} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{1}{M_i} \sum_{j=1}^{M_i} I_{ij} \right) $$ Here, $N$ is the total number of questions, $M_i$ is the number of subquestions for the i-th question, and $I_{ij}$ is the indicator function for the j-th subquestion of the i-th question. Accuracy by Questions (ABQ): $$ \text{ABQ} = \frac{1}{N} \sum_{i=1}^{N} \left( \prod_{j=1}^{M_i} I_{ij} \right) $$ In this expression, the product $\prod_{j=1}^{M_i} I_{ij}$ equals 1 if all subquestions of the \(i\)-th question are answered correctly, and 0 otherwise. Uniform Accuracy by Subquestions (UASQ): $$ \text{UASQ} = \frac{1}{\sum_{i=1}^{N} M_i} \sum_{i=1}^{N} \sum_{j=1}^{M_i} I_{ij} $$ Here, the total accuracy is the sum of the values of the indicator function across all subquestions, normalized by the total number of subquestions in the dataset.

Pipeline Onboarding

Setup

  1. Please initialize the environment using the following code:
  2. conda create -n agent python==3.9.12
    pip3 install -r requirements.txt
  3. We also build a Python code sandbox in our pipeline based on docker, you can use the following code to build your docker image:
  4. docker build -t myimg .

Example 1: Demo Usage

  1. You can easily use the following command to start a demo using APIs.
  2. # initialize poetry
    poetry init
    # Supported LLM: OPEN_AI, AZURE_OPEN_AI
    # api_key is required for API-based models
    bash run_demo.sh --llm AZURE_OPEN_AI --api_key 123
  3. After running the above code, an interactive frontend interface will be displayed.
  4. You can enter prompts in the dialogue box, and if you need to upload a file, you can select the file for upload in the "browse files" section.
  5. Click the "run code interpreter" button, and the backend will execute our pipeline. The agent will generate code and execute it in the sandbox, and the results will be returned on the interactive page upon completion.

Example 2: Running With Local Models

  1. Our local LLM service is developed on vLLM. First, you can start a vLLM model serving by running this command:
  2. # Take llama-2-7b as an example
    python3 src/activities/vllm_api_server.py --model "meta-llama/Llama-2-7b-hf"  --served_model_name "meta-llama/Llama-2-7b-hf"

    At this point, we only support Linux environment.

  3. You can try this command if the serving is successfully starting:
  4. curl http://localhost:8000/v1/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "meta-llama/Llama-2-7b-hf",
            "prompt": "San Francisco is a",
            "max_tokens": 7,
            "temperature": 0
        }'
  5. Then you can run the demo following this command:
  6. bash run_demo.sh --llm "meta-llama/Llama-2-7b-hf"

Example 3: Running Without Front-end

    Our demo is designed to default to the use of the front-end. If you prefer not to use the front-end and instead perform command-line operations or process large amounts of data, you can refer to the following commands:

    # Run with API.
    python3 ./src/activities/eval.py --llm AZURE_OPEN_AI --api_key 123
    # Run with local model.Take llama-2-7b as an example.
    python3 ./src/activities/eval.py --llm "meta-llama/Llama-2-7b-hf"

BibTeX

@misc{hu2024infiagentdabench,
      title={InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks}, 
      author={Xueyu Hu and Ziyu Zhao and Shuang Wei and Ziwei Chai and Guoyin Wang and Xuwu Wang and Jing Su and Jingjing Xu and Ming Zhu and Yao Cheng and Jianbo Yuan and Kun Kuang and Yang Yang and Hongxia Yang and Fei Wu},
      year={2024},
      eprint={2401.05507},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}