Tech · AADS · Data Quality · 2024-12-02

Building QA Datasets for LLM Fine-Tuning
in Regulation & Governance (EU AI Act):
A Data Quality Perspective

2024-12-02
Pebblous Data Communication Team
한국어 | EN

20 QA samples built from 5 Regulation & Governance (EU AI Act) datasets

Executive Summary

The EU AI Act is the European Union's comprehensive regulatory framework for ensuring the safety of AI systems and protecting fundamental rights. This dataset systematically constructs QA pairs to enable LLMs to effectively learn the core aspects of this regulatory framework, supporting regulatory compliance automation and the development of domain-specific models in AI governance.

We built 20 QA pairs from 5 datasets, classified into 4 types (domain definition, data structure, AI model, quality management). Each type is designed for multi-layered understanding of regulatory text, and has been validated by domain experts to ensure accuracy and practical relevance.

With the full enforcement of the EU AI Act in 2024, demand for regulatory compliance among high-risk AI system operators is surging. This QA dataset can serve as foundational data for developing practical AI solutions such as RAG systems, regulatory advisory chatbots, and CoT-based multi-step reasoning models.

Introduction

This post introduces the process of building high-quality QA datasets for LLM fine-tuning in the Regulation & Governance (EU AI Act) domain.

A total of 20 QA pairs were constructed from 5 datasets, with each dataset providing 4 types (A, B, C, D) of question-answer samples.

As part of the AADS (Agentic AI Data Scientist) project, we present a systematic QA data construction methodology from a data quality perspective.

Dataset Composition

5
Datasets
20
QA Pairs
4
QA/Dataset

QA Type Descriptions

Type A: Domain Definition/Purpose

Questions about dataset purpose, background, scope of application, and domain definitions

Type B: Data Structure/Composition

Questions about data schemas, field composition, data relationships, and structural aspects

Type C: AI Model/Tasks

Questions about AI model application, machine learning tasks, algorithm selection, and AI-related topics

Type D: Quality/Process Management

Questions about data quality verification, process management, and standards compliance

5 Datasets in Detail

1️⃣ General Objective: Fostering the Single Market and Trustworthy AI

The general objective of the EU AI Act is to ensure the proper functioning of the single market and to create conditions for the development and use of trustworthy AI.

QA Sample (Type A):

Q: What are the general objectives of the EU AI Act, and what core values of the European Union are they based on?

A: The general objectives of the intervention are to ensure the proper functioning of the **single market** and to create conditions ensuring that AI systems are safe and comply with existing laws and **Union values**.

2️⃣ Specific Objective 1: Safety and Respect for Fundamental Rights

This objective ensures safety for high-risk AI systems while respecting fundamental rights and Union values.

QA Sample (Type B):

Q: What types of risks must providers of high-risk AI systems document in their technical documentation to mitigate potential impacts on fundamental rights?

A: Providers must document the **potential side effects** and **fundamental rights risks** posed by the AI system, covering all foreseeable situations that may affect the system's accuracy, fairness, robustness, and safety.

3️⃣ Specific Objective 2: Ensuring Legal Certainty

Provides a clear and harmonized legal framework to promote AI investment and innovation.

QA Sample (Type C):

Q: Regarding technical solutions for complying with the AI Act's regulatory requirements, what critical decisions must providers document?

A: Providers must document all decisions regarding **possible trade-offs** made in connection with the technical solutions adopted to comply with the requirements set out in Chapter III, Section 2.

4️⃣ Specific Objective 3: Strengthening Governance and Effective Enforcement

Strengthens the effective enforcement of fundamental rights and safety requirements applicable to AI systems.

QA Sample (Type D):

Q: As part of the governance system, how should the activities of the **Advisory Forum** be disclosed to the public?

A: The Advisory Forum must prepare an **annual report** on its activities, and this report must be **made publicly available**.

5️⃣ Specific Objective 4: Facilitating Single Market Development and Preventing Market Fragmentation

Facilitates the development of a single market for lawful, safe, and trustworthy AI applications while preventing market fragmentation.

QA Sample (Type A):

Q: What economic benefits does the fourth specific objective aim to achieve, and what negative outcomes does it seek to prevent from a regulatory perspective?

A: It aims to facilitate the **single market development** of lawful, safe, and trustworthy AI applications, and to prevent **market fragmentation** caused by divergent regulations among member states.

QA Statistics

Distribution by Type

Type A
5 (25.0%)
Type B
5 (25.0%)
Type C
5 (25.0%)
Type D
5 (25.0%)

Pebblous Perspective

1. The Importance of Domain Knowledge in Regulation & Governance (EU AI Act)

The Regulation & Governance (EU AI Act) domain requires specialized domain knowledge. This QA dataset is designed to enable LLMs to effectively learn the core concepts and practical knowledge of this field.

2. LLM Fine-Tuning Utilization Strategy

The 20 QA pairs constructed can be utilized for LLM fine-tuning in the following ways:

  • Domain-Specific Model Development: Building specialized chatbots and assistants for Regulation & Governance (EU AI Act)
  • Retrieval-Augmented Generation (RAG): Question-answering systems combined with vector databases
  • Multi-Step Reasoning: Chain-of-Thought (CoT) learning for solving complex domain problems
  • Quality Evaluation Benchmark: Gold standard for evaluating LLM response quality

Frequently Asked Questions (FAQ)

What can this QA dataset be used for?

It can be used for LLM fine-tuning in the Regulation & Governance (EU AI Act) domain, building RAG systems, developing domain-specific chatbots, and more.

How is the quality of the QA dataset ensured?

Each QA pair has been reviewed by domain experts and is systematically organized across 4 types (A, B, C, D).

Can the dataset be used commercially?

The specific license follows the original license of each dataset source. Please verify the source before commercial use.