InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

Leaderboard

In this section, we furnish a comprehensive evaluation of both close-source LLMs such as GPT-4 and GPT-3.5, as well as widely-utilized open-source LLMs. In addition, we test our DAAgent, which is an agent for data analysis with instruction-tuning.

Notice: We set temperature=0.2, top_p=1.0 and frequency_penalty=0.0 for all the models.

Rank	Model Name	# Params. (in B)	Proportional Accuracy by Subquestions	Accuracy by Questions	Uniform Accuracy by Subquestions
1	gpt-4-0613	/	74.60%	78.72%	79.01%
2	daagent-34b	34	60.77%	67.50%	62.98%
3	abab5.5-chat	/	60.13%	65.94%	64.27%
4	gpt-3.5-turbo-0613	/	58.20%	65.70%	61.88%
5	qwen-72b-chat	72	57.56%	62.46%	56.17%
6	daagent-13b	13	54.52%	60.51%	55.72%
7	gemini-pro	/	53.38%	58.32%	51.93%
8	mixtral-8x7b-instruct-v0.1	46.7(12.9)	49.20%	54.02%	51.38%
9	daagent-7b	7	49.02%	57.63%	54.19%
10	deepseek-coder-33b-instruct	33	44.84%	48.90%	46.31%
11	claude-2.1	/	43.41%	49.65%	46.96%
12	phind-codellama-34b-v2	34	42.02%	47.41%	44.40%
13	xwincoder-34b	34	39.87%	45.46%	42.73%
14	mistral-7b-instruct-v0.2	7	36.45%	41.61%	36.90%
15	qwen-14b-chat	14	36.45%	41.36%	34.51%
16	qwen-7b-chat	7	27.27%	27.27%	16.00%
17	vicuna-13b-v1.5	13	26.26%	30.88%	26.31%
17	internlm-chat-20b	20	23.79%	26.82%	25.05%
17	wizardcoder-python-34b-v1.0	34	23.13%	26.45%	22.40%
18	agentlm-7b	7	16.99%	20.71%	17.89%
19	chatglm3-6b	6	16.67%	20.59%	19.27%
20	codellama-34b-instruct	34	14.56%	17.39%	13.94%

Overview

The advent of Large Language Models (LLMs) has spurred the development of LLM-augmented Autonomous Agents (LAAs). These agents are capable of generating and executing code through ongoing interactions between their core LLM and the code execution environment. In this project, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents in data analysis tasks. This benchmark contains DAEval, a dataset consisting of data analysis questions derived from CSV files, and an agent framework to evaluate LLMs as data analysis agents. This page describes the details of InfiAgent-DABench framework, including features such as dataset construction, evaluation metrics, analytical assessment, and the procedural details about pipeline onboarding.

Dataset Construction

We build data analysis query and response given existing csv files. Here is the construction pipeline.

We split the dataset into a validation set and a test set. The validation dataset contains 311 questions with 55 csv files. We only public the validation set to avoid data leakage. Here're some examples:

We categorize CSV files within the dataset into nine distinct categories, determined by their respective domains:

Finance and Economics
Health and Medical
Demographics and Social Science
Marketing and Consumer Behavior
Energy and Environmental Monitoring
Transportation, Logistics, and Tourism
Culture, Entertainment, and Media
Scientific Research and Technology
Other Categories

Below is the pie chart depicting the categorical distribution:

We conduct statistical analyses on the individual concepts associated with each question, accounting for scenarios where a question encompasses multiple concepts:

Evaluation

For closed-form questions, we prompt LLMs with question description. Considering that most models hardly follow the format requirements, we add a reformat step for all models by using gpt-3.5-turbo-16k to format the responses given the format requirements. Here's a figure illustrating this process:

Metrics

For closed-form questions, we have defined the following metrics: Proportional Accuracy by Subquestions (PASQ): $$ \text{PSAQ} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{1}{M_i} \sum_{j=1}^{M_i} I_{ij} \right) $$ Here, $N$ is the total number of questions, $M_i$ is the number of subquestions for the i-th question, and $I_{ij}$ is the indicator function for the j-th subquestion of the i-th question. Accuracy by Questions (ABQ): $$ \text{ABQ} = \frac{1}{N} \sum_{i=1}^{N} \left( \prod_{j=1}^{M_i} I_{ij} \right) $$ In this expression, the product $\prod_{j=1}^{M_i} I_{ij}$ equals 1 if all subquestions of the $i$-th question are answered correctly, and 0 otherwise. Uniform Accuracy by Subquestions (UASQ): $$ \text{UASQ} = \frac{1}{\sum_{i=1}^{N} M_i} \sum_{i=1}^{N} \sum_{j=1}^{M_i} I_{ij} $$ Here, the total accuracy is the sum of the values of the indicator function across all subquestions, normalized by the total number of subquestions in the dataset.

Pipeline Onboarding

Setup

Please initialize the environment using the following code:

conda create -n agent python==3.9.12
pip3 install -r requirements.txt

We also build a Python code sandbox in our pipeline based on docker, you can use the following code to build your docker image:

docker build -t myimg .

Example 1: Demo Usage

You can easily use the following command to start a demo using APIs.

# initialize poetry
poetry init
# Supported LLM: OPEN_AI, AZURE_OPEN_AI
# api_key is required for API-based models
bash run_demo.sh --llm AZURE_OPEN_AI --api_key 123

After running the above code, an interactive frontend interface will be displayed.

You can enter prompts in the dialogue box, and if you need to upload a file, you can select the file for upload in the "browse files" section.
Click the "run code interpreter" button, and the backend will execute our pipeline. The agent will generate code and execute it in the sandbox, and the results will be returned on the interactive page upon completion.

Example 2: Running With Local Models

Our local LLM service is developed on vLLM. First, you can start a vLLM model serving by running this command:

# Take llama-2-7b as an example
python3 src/activities/vllm_api_server.py --model "meta-llama/Llama-2-7b-hf"  --served_model_name "meta-llama/Llama-2-7b-hf"

At this point, we only support Linux environment.

You can try this command if the serving is successfully starting:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-hf",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Then you can run the demo following this command:

bash run_demo.sh --llm "meta-llama/Llama-2-7b-hf"

Example 3: Running Without Front-end

Our demo is designed to default to the use of the front-end. If you prefer not to use the front-end and instead perform command-line operations or process large amounts of data, you can refer to the following commands:

# Run with API.
python3 ./src/activities/eval.py --llm AZURE_OPEN_AI --api_key 123
# Run with local model.Take llama-2-7b as an example.
python3 ./src/activities/eval.py --llm "meta-llama/Llama-2-7b-hf"

BibTeX

@misc{hu2024infiagentdabench,
      title={InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks}, 
      author={Xueyu Hu and Ziyu Zhao and Shuang Wei and Ziwei Chai and Guoyin Wang and Xuwu Wang and Jing Su and Jingjing Xu and Ming Zhu and Yao Cheng and Jianbo Yuan and Kun Kuang and Yang Yang and Hongxia Yang and Fei Wu},
      year={2024},
      eprint={2401.05507},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

ICML 2024:
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

InfiAgent-DABench is a project to build and evalute agents for advanced data analysis. Agent evaluation has been an open and challenging problem.

Leaderboard

Overview

Dataset Construction

Evaluation

Metrics

Pipeline Onboarding

Setup

Example 1: Demo Usage

Example 2: Running With Local Models

Example 3: Running Without Front-end

BibTeX

ICML 2024:InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

InfiAgent-DABench is a project to build and evalute agents for advanced data analysis. Agent evaluation has been an open and challenging problem.

Leaderboard

Overview

Dataset Construction

Evaluation

Metrics

Pipeline Onboarding

Setup

Example 1: Demo Usage

Example 2: Running With Local Models

Example 3: Running Without Front-end

BibTeX

ICML 2024:
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks