LLM Eval: Benchmarking Suite for Large Language Models

Published:

A Python package for benchmarking and evaluating large language models (LLMs) on a standard benchmark datasets. Benchmark parameters can be set using a JSON configuration file. The package supports benchmarking on multiple datasets and evaluation metrics, and is designed to be easily extensible to add new datasets and evaluation metrics.

GitHub