Yue Huang | 黄跃 — Toolkit&Dataset

MetaTool

Dataset | ICLR 2024 | A benchmark/dataset designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools.

Paper Data Map Dataset

DataGen

Toolkit | ICLR 2025 | DataGen is an LLM-powered framework designed to generate diverse, accurate, and highly controllable text datasets.

Paper Toolkit

TrustLLM

Toolkit | ICML 2024 | Trustllm (python package) help you assess the performance of your LLM in trustworthiness more quickly.

Project Page Paper Toolkit Docs. Dataset

TrustEval

Toolkit | NAACL 2025 Demo | TrustEval is a modular and extensible toolkit for comprehensive trust evaluation of generative foundation models (GenFMs). This toolkit enables you to evaluate models across various dimensions such as safety, fairness, robustness, privacy, and more.

Project Page Paper Toolkit Docs.

ChemOrch

Toolkit | NeurIPS 2025 | Revolutionizing chemical research through intelligent task orchestration, automated workflows, and AI-powered insights. Transform your expected chemical task into high-quality instruction-response pairs.

Toolkit

EmoNest

Software | NeurIPS 2025 Creative AI | A generative AI framework for creating interactive, emotionally adaptive storytelling game.

Demo (will release in Dec. at conference)

IntraAI

Software | IntraAI aims to achieve three goals: Accessible AI for All, Adaptive Learning and Growth, and Trust and Collective Understanding.

Software (Release Soon)

SDE-Harness

Toolkit | SDE-Harness (Scientific Discovery Evaluation) is a comprehensive, extensible framework designed to accelerate AI-powered scientific discovery.

Toolkit

ValueLence

Dashboard | ValueLence is the first unified platform for dynamic, fine-grained value probing of LLMs. It offers a seamless, end-to-end workflow for value curation, diverse probe generation, scalable response collection, and rigorous multi-dimensional evaluation and visualization.

Dashboard

ProbeLLM

Toolkit | An automated probing framework that discovers structured failure modes of LLMs via hierarchical Monte Carlo Tree Search. It combines UCB-guided Macro/Micro search, tool-augmented test generation (web retrieval, code execution, perturbation), and failure-aware clustering to surface recurring, interpretable weaknesses.

Toolkit

RiskLab

Toolkit | A controlled multi-agent interaction framework for instantiating, probing, and measuring emergent social risks in LLM-based agent collectives. Each risk is specified via a topology–environment–protocol–agent–task quintuple and evaluated by explicit risk indicators.

Toolkit Paper

Research Artifacts