What are the guidelines and requirements for data collection and storage used for AI model training?

Gist 1

High-risk AI systems which make use of techniques involving the training of models with data shall be developed on the basis of training, validation and testing data sets that meet the quality criteria referred to in paragraphs 2 to 5 as far as this is technically feasible according to the specific market segment or scope of application. (Article 10 (1))

The AI Act requires that high-risk AI systems be developed using training, validation, and testing data that meet specific quality criteria, emphasizing the importance of data quality.

Training, validation and testing data sets shall be subject to data governance appropriate for the context of use as well as the intended purpose of the AI system. Those measures shall concern in particular (…) the relevant design choices; (…) transparency as regards the original purpose of data collection; (…) data collection processes; (…) data preparation processing operations, such as annotation, labelling, cleaning, updating enrichment and aggregation; (…) the formulation of assumptions (…) an assessment of the availability, quantity and suitability of the data sets that are needed; (…) examination in view of possible biases that are likely to affect the health and safety of persons, negatively impact fundamental rights or lead to discrimination; (…) the identification of relevant data gaps or shortcomings that prevent compliance with this Regulation, and how those gaps and shortcomings can be addressed. (Article 10 (2))

Article 10(2) sets out specific data governance requirements: transparency of the original purpose for data collection, data collection processes, data preparations operations (like annotation, labelling, cleaning), the identification and mitigation of potential biases, and the observation and solution of any data gaps or deficiencies.

(aa) the nature of data likely or intended to be processed by the system and, in the case of personal data, the categories of natural persons and groups likely or intended to be affected; (Annex IV)

As per Annex IV, AI systems should document the type of data intended to be processed, including personal data, indicating the importance of transparency.

(d) where relevant, the data requirements in terms of datasheets describing the training methodologies and techniques and the training data sets used, including information about the provenance of those data sets, their scope and main characteristics; how the data was obtained and selected; labelling procedures (e.g. for supervised learning), data cleaning methodologies (e.g. outliers detection); (Annex IV)

This further highlights the need to provide detailed information about data collection, cleaning, selection and labelling methods for AI model training, promoting transparency and accountability.

Access to data of high quality plays a vital role in providing structure and in ensuring the performance of many AI systems, especially when techniques involving the training of models are used, with a view to ensure that the high-risk AI system performs as intended and safely and it does not become a source of discrimination prohibited by Union law. High quality training, validation and testing data sets require the implementation of appropriate data governance and management practices. (Recital 44)

Recital 44 re-emphasizes the importance of not only high-quality data in model training but also effective data governance and management practices.

In summary, the AI Act requires AI systems to ensure meticulous data governance, transparency about data collection purposes and methods, error-checking of data, recording of used data types, provision of data sources for model training, and the implementation of third-party certified compliance services. High-quality data is a common requirement and an emphasis is placed on the importance of data in AI model training.

Gist 2

High-risk AI systems which make use of techniques involving the training of models with data shall be developed on the basis of training, validation and testing data sets that meet the quality criteria… Techniques that do not require labelled input data such as unsupervised learning and reinforcement learning shall be developed on the basis of data sets such as for testing and verification that meet the quality criteria. (Article 10)

Quality criteria for training, validation, and testing data sets are essential for the development of high-risk AI systems, irrespective of the learning technique used. This regulation aims to ensure that the data used to train these systems is reliable and of high quality to minimize errors and inaccuracies.

Training, validation and testing data sets shall be subject to data governance appropriate for the context of use as well as the intended purpose of the AI system…Those measures shall concern in particular… data collection processes; data preparation processing operations, such as annotation, labelling, cleaning, updating enrichment and aggregation. (Article 10)

This highlights the requirement for rigorous data governance protocols addressing aspects such as design choices, original purpose of data collection, data collection processes, and more. This is to ensure that data used in AI model training is accurately interpreted and represented, thereby making the AI model reliable.

To the extent that it is strictly necessary for the purposes of ensuring negative bias detection and correction in relation to the high-risk AI systems, the providers of such systems may exceptionally process special categories of personal data… subject to appropriate safeguards for the fundamental rights and freedoms of natural persons, including technical limitations on the re-use and use of state-of-the-art security and privacy-preserving. (Article 10)

This denotes that specific personal data can be used for training AI models, but under stringent conditions. Such use is only permissible when necessary for bias detection and correction. Also, technical and organizational measures need to be implemented to ensure data security and integrity.

Prior to putting a high-risk AI system…into use…deployers shall conduct an assessment of the systems’ impact in the specific context of use. This assessment shall include, at a minimum… the reasonably foreseeable impact on fundamental rights of putting the high-risk AI system into use; specific risks of harm likely to impact marginalized persons or vulnerable groups. (Article 29a)

This underscores the need for a comprehensive impact assessment before using a high-risk AI system. This should consider potential effects on fundamental rights and potential risks to marginalized and vulnerable groups. Hence, data used to train these AI systems should also reflect these considerations.

The Commission, in consultation with the AI office, shall adopt a delegated act detailing the modalities for the establishment, development, implementation, functioning and supervision of the AI regulatory sandboxes, including the eligibility criteria and the procedure for the application, selection, participation and exiting from the sandbox, and the rights and obligations of the participants based on the provisions set out in this Article (Article 53a)

This indicates the use of AI regulatory sandboxes, which provide a controlled environment for developers to experiment and innovate. They are crucial for testing AI prototypes and ensuring that the systems meet all regulatory requirements, including data governance and management practices, before they are deployed.

Access to data of high quality plays a vital role in providing structure and in ensuring the performance of many AI systems, especially when techniques involving the training of models are used, with a view to ensure that the high-risk AI system performs as intended and safely and it does not become a source of discrimination prohibited by Union law. High quality training, validation and testing data sets require the implementation of appropriate data governance and management practices. (Recital 44)

This affirms the importance of high-quality data for the successful functioning of AI systems. It necessitates good data governance and management practices for maintaining the quality of training, validation, and testing data.

In order to protect the right of others from the discrimination that might result from the bias in AI systems, the providers should, exceptionally and following the application of all applicable conditions laid down under this Regulation and in Regulation (EU) 2016/679, Directive (EU) 2016/680 and Regulation (EU) 2018/1725, be able to process also special categories of personal data… (Recital 44)

This means providers, following all associated conditions, should process special categories of personal data in very specific conditions. It is meant to prevent potential discrimination resulting from biases in AI systems.

The nature of data likely or intended to be processed by the system and, in the case of personal data, the categories of natural persons and groups likely or intended to be affected. (Annex IV - Point 1)

This clause of the Annex focuses particularly on data transparency and aims to ensure that the processing of data by AI systems does not unintentionally infringe on the privacy of individuals. AI systems should disclose the types of data they’re designed to process, with special emphasis on personal data.

Where relevant, the data requirements in terms of datasheets describing the training methodologies and techniques and the training data sets used, including information about the provenance of those data sets, their scope and main characteristics; how the data was obtained and selected; labelling procedures (e.g. for supervised learning), data cleaning methodologies (e.g. outliers detection). (Annex IV - Point 2)

This segment underlines the need for transparency in the AI system training process. It mandates detailed documentation of the AI system’s data requirements, including the origin, scope, and main features of the data.

Each of these regulations contributes to building a robust and comprehensive framework for data collection and storage used for AI model training. This is meant to ensure high-quality and ethical AI outputs while preserving fundamental rights and freedoms.

Gist 3

High-risk AI systems which make use of techniques involving the training of models with data shall be developed on the basis of training, validation and testing data sets that meet the quality criteria referred to in paragraphs 2 to 5 as far as this is technically feasible according to the specific market segment or scope of application. (Article 10)

This portion introduces the requirement that high-risk AI systems, especially those that are trained with data, should be developed using training, validation, and testing data sets that conform to specific quality criteria. It means that in data collection, particular attention must be given to the quality of data which will significantly impact the development and functionality of AI models.

Training, validation and testing data sets shall be subject to data governance appropriate for the context of use as well as the intended purpose of the AI system. (Article 10)

This excerpt highlights the need for suitable data governance practices for the collection of training, validation, and testing data sets. It means that the data sets should be managed and controlled in a manner befitting their purpose and usage context, stressing the importance of effective data governance in the development of AI systems.

Providers having recourse to this provision shall draw up documentation explaining why the processing of special categories of personal data was necessary to detect and correct biases. (Article 10)

This provision mentions the need for documentation to justify the processing of specific categories of personal data when detecting and correcting biases. It seeks assurance that any use of such sensitive data aligns with the directives of data protection legislation.

In the AI regulatory sandbox personal data lawfully collected for other purposes may be processed solely for the purposes of developing and testing certain AI systems in the sandbox when all of the following conditions are met. (Article 54)

This provision from Article 54 permits the usage of previously collected personal data for developing and testing AI systems within the context of a regulatory sandbox. It implies that the data used within these contained environments must be lawfully obtained and used strictly for sandbox testing purposes, adhering to specified conditions for data processing.

The Commission shall develop, in consultation with the AI office, guidelines on the practical implementation of this Regulation, and in particular on: the application of the requirements referred to in Articles 8 - 15 and Article 28 to 28b. (Article 82b)

Finally, this quote from Article 82b mentions that the Commission, along with the AI office, will provide guidelines for practically implementing the regulations, including those connected to data collection and data governance for AI systems. This shows that practical guidelines will be provided, aiming to make the implementation of the laws more manageable and comprehensive.

(d) where relevant, the data requirements in terms of datasheets describing the training methodologies and techniques and the training data sets used, including information about the provenance of those data sets, their scope and main characteristics; how the data was obtained and selected; labelling procedures (e.g. for supervised learning), data cleaning methodologies (e.g. outliers detection). (Annex IV, section 2(d))

This section outlines the data requirements for AI model development. It is expected that in the technical documentation there should be a proper data requirement section where the training methodologies, techniques, and the training data sets used are extensively described. This description should encompass the origination of such data sets, their scope, and their main characteristics. Moreover, the procedures for how the data was achieved and selected, the labelling procedures for supervised learning, as well as the data cleaning methodologies, including outlier detection, should also be stated clearly.

(b) a description of the architecture, design specifications, algorithms and the data structures including a decomposition of its components and interfaces, how they relate to one another and how they provide for the overall processing or logic of the AI system; the key design choices including the rationale and assumptions made, also with regard to persons or groups of persons on which the system is intended to be used; the main classification choices; what the system is designed to optimise for and the relevance of the different parameters; the decisions about any possible trade-off made regarding the technical solutions adopted to comply with the requirements set out in Title III, Chapter 2. (Annex IV, section 2(b))

This provision points to the need for complete transparency with regards to how the AI system is constructed for data handling. Apart from the description of the architecture, design specifications, and algorithms, the data structures including the decomposition of the system’s components and their relation must be well defined. Also, it advocates for the discernment and insights on which the system’s design choices were based, including the key assumptions about the user groups or individuals the system is intended for. A significant emphasis is placed on clarifying what the AI system is optimized for and revealing any possible trade-offs related to the technical solutions adopted to meet the specified regulatory criteria.

The right to privacy and to protection of personal data must be guaranteed throughout the entire lifecycle of the AI system. In this regard, the principles of data minimisation and data protection by design and by default, as set out in Union data protection law, are essential when the processing of data involves significant risks to the fundamental rights of individuals. (Recital 45a)

This emphasizes the importance of ensuring the privacy and security of personal data at all stages of an AI system’s lifecycle. Essentially, it guides that during the processing of data for AI, when there is a substantial risk involved to the fundamental rights of individuals, adherence to union data protection law is mandatory. This includes methods like data minimisation and data protection by design and by default.

Providers and users of AI systems should implement state-of-the-art technical and organisational measures in order to protect those rights. Such measures should include not only anonymisation and encryption, but also the use of increasingly available technology that permits algorithms to be brought to the data and allows valuable insights to be derived without the transmission between parties or unnecessary copying of the raw or structured data themselves. (Recital 45a)

This part discusses the responsibility of AI system providers and users to integrate advanced technical and organizational measures to ensure data rights. The recital suggests the use of anonymisation and encryption as techniques for protecting personal data. Interestingly, it makes a special mention of the need to use new technologies that allow the ‘bringing of algorithms to the data’ as opposed to moving the data between parties. This idea helps to minimize the chances of data leakage during transmission and unnecessary duplication.

Gist 4

High-risk AI systems which make use of techniques involving the training of models with data shall be developed on the basis of training, validation and testing data sets that meet the quality criteria referred to in paragraphs 2 to 5 as far as this is technically feasible according to the specific market segment or scope of application. (Article 10)

This quote implies that high-risk AI systems employing models trained with data should be developed using training, validation, and testing data sets. These datasets should adhere to the quality criteria mentioned in the same Article, given it is technically viable according to the market segment or application scope.

Training, validation and testing data sets shall be subject to data governance appropriate for the context of use as well as the intended purpose of the AI system. (Article 10)

This provision makes it clear that data sets used for training, validation, and testing should be managed appropriately, considering their context of use and purpose of the AI system. This cues toward the necessity for effective data collection and management protocols.

Training datasets, and where they are used, validation and testing datasets, including the labels, shall be relevant, sufficiently representative, appropriately vetted for errors and be as complete as possible in view of the intended purpose. (Article 10)

This requirement mandates that datasets used for training, validation, and testing, including their labels, must be relevant, representative, error-free, and as complete as possible for the purpose they are designed for. This reinforces the need for carefully curated datasets for training AI systems.

To the extent that it is strictly necessary for the purposes of ensuring negative bias detection and correction in relation to the high-risk AI systems, the providers of such systems may exceptionally process special categories of personal data referred to in Article 9(1) of Regulation (EU) 2016/679, Article 10 of Directive (EU) 2016/680 and Article 10(1) of Regulation (EU) 2018/1725, subject to appropriate safeguards for the fundamental rights and freedoms of natural persons. (Article 10)

This highlights the allowance for the exceptional processing of special categories of personal data, only strictly needed to ensure bias detection and correction in high-risk AI systems—signifying the importance of careful handling of sensitive personal data in AI systems training.

The technical documentation of a high-risk AI system shall be drawn up before that system is placed on the market or put into service and shall be kept up-to-date. (Article 11)

This indicates that technical documentation covering how data are gathered and stored in compliance with given requirements should be prepared prior to the deployment of these high-risk AI systems and should be kept current.

Providers of foundation models shall, for a period ending 10 years after their foundation models have been placed on the market or put into service, keep the technical documentation referred to in paragraph 2(e) at the disposal of the national competent authorities. (Article 28b)

This clarifies the timeline for which the technical documentation should be readily available, indicating that this should be up to 10 years after the AI system’s deployment.

(aa) the nature of data likely or intended to be processed by the system and, in the case of personal data, the categories of natural persons and groups likely or intended to be affected; (Annex IV, item 1aa)

Emphasizing the need for documenting the kind of data intended to be processed by the AI system. If personal data is involved, the likely or intended affected categories of individuals or groups should be recorded.

Detailed information about the monitoring, functioning and control of the AI system, in particular with regard to: its capabilities and limitations in performance, including the degrees of accuracy for specific persons or groups of persons on which the system is intended to be used and the overall expected level of accuracy in relation to its intended purpose; the foreseeable unintended outcomes and sources of risks to health and safety, fundamental rights and discrimination in view of the intended purpose of the AI system; the human oversight measures needed in accordance with Article 14, including the technical measures put in place to facilitate the interpretation of the outputs of AI systems by the deployers; specifications on input data, as appropriate; (Annex IV, item 3)

This note sets out the mandate for providing detailed information about how the AI system is monitored and controlled, including its performance constraints and accuracy levels, especially in relation to certain groups of people. It also outlines the need to address any unplanned outcomes, potential risks, and clarify the measures for human oversight.

The data requirements in terms of datasheets describing the training methodologies and techniques and the training data sets used, including information about the provenance of those data sets, their scope and main characteristics; how the data was obtained and selected; labelling procedures (e.g. for supervised learning), data cleaning methodologies (e.g. outliers detection); (Annex IV, item 2d)

This clause pinpoints to the requisition for datasheets describing the training methodologies and techniques and the training datasets used, encompassing specific information like the data sets’ provenance, scope, and main characteristics, as well as data collection, selection, labeling and cleaning methodologies. This ensures clarity of data usage and promotes integrity and transparency in AI systems.

A detailed description of the system in place to evaluate the AI system performance in the post-market phase in accordance with Article 61, including the post-market monitoring plan referred to in Article 61(3). (Annex IV, item 8)

This provision implies the requirement for comprehensive descriptions of post-market performance assessment systems for the AI, including a detailed plan for post-market monitoring adhering to Article 61 of the Act.

These requirements collectively ensure that data collection, storage, and usage related to training of AI systems are transparent, accountable, and clearly defined. The need for thorough documentation promotes ethical standards, data security, and reinforces the integrity of the AI system being developed.