Toolkit&Dataset

metatool
MetaTool
Dataset | ICLR 2024 | A benchmark/dataset designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools.
datagen
DataGen
Toolkit | ICLR 2025 | DataGen is an LLM-powered framework designed to generate diverse, accurate, and highly controllable text datasets.
trustllm
TrustLLM
Toolkit | ICML 2024 | Trustllm (python package) help you assess the performance of your LLM in trustworthiness more quickly.
trusteval
TrustEval
Toolkit | NAACL 2025 Demo | TrustEval is a modular and extensible toolkit for comprehensive trust evaluation of generative foundation models (GenFMs). This toolkit enables you to evaluate models across various dimensions such as safety, fairness, robustness, privacy, and more.
valuelence
ValueLence
Dashboard | ValueLence is the first unified platform for dynamic, fine-grained value probing of LLMs. It offers a seamless, end-to-end workflow for value curation, diverse probe generation, scalable response collection, and rigorous multi-dimensional evaluation and visualization.
chemorch
ChemOrch
Toolkit | Revolutionizing chemical research through intelligent task orchestration, automated workflows, and AI-powered insights. Transform your expected chemical task into high-quality instruction-response pairs.
sde-harness
SDE-Harness
Toolkit | SDE-Harness (Scientific Discovery Evaluation) is a comprehensive, extensible framework designed to accelerate AI-powered scientific discovery.