I want to develop a simple AI system that would help employers monitor and manage the workload in their organisation. What kind of data can I use to train, validate and test the system?

Gist 1

High-risk AI systems which make use of techniques involving the training of models with data shall be developed on the basis of training, validation and testing data sets that meet the quality criteria referred to in paragraphs 2 to 5 as far as this is technically feasible according to the specific market segment or scope of application. (Article 10)

The use of data for training, validating, and testing AI systems that are deemed high-risk, which could include a workload management system depending on its functionality and potential impact, should conform to specific quality requirements. This implies a need for careful data selection and processing that caters to the AI system’s specific purpose and appropriate handling within its market context.

Training, validation and testing data sets shall be subject to data governance appropriate for the context of use as well as the intended purpose of the AI system. Those measures shall concern in particular… data collection processes… data preparation processing operations, such as annotation, labelling, cleaning, updating enrichment and aggregation… an assessment of the availability, quantity and suitability of the data sets that are needed. (Article 10)

Data governance should closely consider the AI system’s context of use and intended purpose. This mandates carefully planned data collection and preparation processes. The suitability of data sets, their availability and quantity, should also be thoroughly assessed in line with the purpose and application of the AI system.

AI systems intended to be used for recruitment or selection of natural persons, notably for placing targeted job advertisements screening or filtering applications, evaluating candidates in the course of interviews or tests; AI systems intended to be used to make or materially influence decisions affecting the initiation, promotion and termination of work-related contractual relationships, task allocation based on individual behavior or personal traits or characteristics, or for monitoring and evaluating performance and behavior of persons in such relationships. (Annex III, Point 4: Employment, workers management and access to self-employment)

The system in question, aimed at workload management, could likely fall within the high-risk AI systems bracket defined in Annex III direction towards systems in employment and worker management contexts. Therefore, more stringent data governance protocols may apply to the development and training of this system.

(aa) the nature of data likely or intended to be processed by the system and, in the case of personal data, the categories of natural persons and groups likely or intended to be affected; (d) where relevant, the data requirements in terms of datasheets describing the training methodologies and techniques and the training data sets used, including information about the provenance of those data sets, their scope and main characteristics; how the data was obtained and selected; labelling procedures (e.g. for supervised learning), data cleaning methodologies (e.g. outliers detection). (Annex IV, 1.aa and 2.d)

Finally, your technical approach to data should be well documented, especially with regards to training methodologies and techniques, the provenance of datasets used, their scope and main features, how they were obtained and selected, labelled in the case of supervised learning, and cleaned, specifically in relation to outlier detection. This will require careful documentation practices throughout the development of your AI system.

Remember, when dealing with AI system development, governance context, data relevance, security measures, and processes are essential. Be sure to enlist expert help to ensure you are meeting all data handling and governance requirements.

Gist 2

High-risk AI systems which make use of techniques involving the training of models with data shall be developed on the basis of training, validation and testing data sets that meet the quality criteria referred to in paragraphs 2 to 5 as far as this is technically feasible according to the specific market segment or scope of application. Techniques that do not require labelled input data such as unsupervised learning and reinforcement learning shall be developed on the basis of data sets such as for testing and verification that meet the quality criteria referred to in paragraphs 2 to 5. (Article 10)

This suggests that, regardless of the learning technique employed, the AI system you’re developing must utilize data sets that adhere to established quality criteria, provided these are technically feasible in the context of your sector or application.

Training datasets, and where they are used, validation and testing datasets, including the labels, shall be relevant, sufficiently representative, appropriately vetted for errors and be as complete as possible in view of the intended purpose. They shall have the appropriate statistical properties, including, where applicable, as regards the persons or groups of persons in relation to whom the high-risk AI system is intended to be used. (Article 10)

The data sets you plan to use for training, validation, and testing must fit the purpose of the AI system and be sufficiently comprehensive. They should avoid range bias, minimize errors, and represent the populations your AI system will interact with.

(4. Employment, workers management and access to self-employment: (a) AI systems intended to be used for recruitment or selection of natural persons, notably for placing targeted job advertisements screening or filtering applications, evaluating candidates in the course of interviews or tests; (b) AI systems intended to be used to make or materially influence decisions affecting the initiation, promotion and termination of work-related contractual relationships, task allocation based on individual behaviour or personal traits or characteristics, or for monitoring and evaluating performance and behavior of persons in such relationships;) (Annex III: 4a,b)

Your AI system, which aims to help employers monitor and manage workload, might be considered “high-risk” under the classification of systems used in employment and workers management. Particularly, it seems to fit the stipulated use case of influencing decisions on task allocation based on behavior or personal traits, or for monitoring and evaluating the performance and behavior of individuals.

Deployers shall monitor the operation of the high-risk AI system on the basis of the instructions of use and when relevant, inform providers in accordance with Article 61. (Article 29)

As a deployer of the AI system, you hold the responsibility of monitoring its operation based on its usage instructions. If relevant, you need to relay this information to the system’s providers.

Providers shall ensure that AI systems intended to interact with natural persons are designed and developed in such a way that the AI system, the provider itself or the user informs the natural person exposed to an AI system that they are interacting with an AI system in a timely, clear and intelligible manner, unless this is obvious from the circumstances and the context of use. (Article 52)

Lastly, be aware that if your AI system interacts with humans (such as employees), you must ensure transparent communication. It should be clear and apparent that they are interacting with an AI system, unless it’s obvious in the context.

To sum up, while the EU AI Act doesn’t precisely mention what kinds of data can be used for the development of such systems, it emphasizes quality, representativeness, and the appropriate statistical properties of the training data. Furthermore, your AI system appears to fall under the classification of “high-risk” systems, and thus must adhere to additional requirements, like transparency and operational monitoring. Your choice of data should also respect other relevant EU data protection and privacy laws.

Gist 3

Access to data of high quality plays a vital role in providing structure and in ensuring the performance of many AI systems, especially when techniques involving the training of models are used, with a view to ensure that the high-risk AI system performs as intended and safely and it does not become a source of discrimination prohibited by Union law. (Recital 44)

The EU AI Act emphasizes the importance of high quality data for the effective functioning of AI systems. This means that the data you choose to train, validate, and test your AI system intended for workload management should meet a certain standard of quality, relevance, and fairness.

High quality training, validation and testing data sets require the implementation of appropriate data governance and management practices. Training, and where applicable, validation and testing data sets, including the labels, should be sufficiently relevant, representative, appropriately vetted for errors and as complete as possible in view of the intended purpose of the system. (Recital 44)

Considering this, the data sets you choose for training, validation, and testing should not only be of high quality, but also relevant and representative of the workload and tasks in the organization. In addition, the data must be appropriately vetted for errors and as complete as possible.

They should also have the appropriate statistical properties, including as regards the persons or groups of persons in relation to whom the high-risk AI system is intended to be used, with specific attention to the mitigation of possible biases in the datasets, that might lead to risks to fundamental rights or discriminatory outcomes for the persons affected by the high-risk AI system. (Recital 44)

The EU AI Act also requires that the data represent the diversity of the individuals that might use or be affected by your AI system to avoid any biases that might lead to discrimination. Since your AI system is meant for an organization, it is necessary to consider the diversity of roles, seniority levels, departments, and possibly demographic factors of the employees.

In particular, training, validation and testing data sets should take into account, to the extent required in the light of their intended purpose, the features, characteristics or elements that are particular to the specific geographical, contextal, behavioural or functional setting or context within which the AI system is intended to be used. (Recital 44)

This implies that the data should be context-specific to the usage scenario. If your AI system is designed for a specific organization or type of organization, the data should capture the peculiarities of this workplace context.

the nature of data likely or intended to be processed by the system and, in the case of personal data, the categories of natural persons and groups likely or intended to be affected; (Annex IV)

Annex IV allows your AI system to process any type of data, provided it is relevant to its purpose. If personal data is involved, you must identify the categories of individuals or groups who would be affected.

where relevant, the data requirements in terms of datasheets describing the training methodologies and techniques and the training data sets used, including information about the provenance of those data sets, their scope and main characteristics; how the data was obtained and selected; labelling procedures (e.g. for supervised learning), data cleaning methodologies (e.g. outliers detection); (Annex IV)

Any data you use for training, validating, and testing your AI system must be well-documented. You must describe the datasets, including details about their origin, characteristics, and how they were selected, labeled and cleaned.

the validation and testing procedures used, including information about the validation and testing data used and their main characteristics; metrics used to measure accuracy, robustness and compliance with other relevant requirements set out in Title III, Chapter 2 as well as potentially discriminatory impacts; test logs and all test reports dated and signed by the responsible persons, including with regard to pre-determined changes as referred to under point (f). (Annex IV)

Test logs and reports detailing the validation and testing procedures, the metrics used to measure the system’s accuracy and robustness, and any potential discriminatory impacts must also be included. These documents should be signed and dated by those responsible.

In conclusion, the AI Act emphasizes the need for high quality, relevant, unbiased, and context-specific data for developing an AI system. There is also an emphasis on transparent handling of data, rigorous documentation processes, and ensuring the protection of individuals that may be affected by the system.

Gist 4

High-risk AI systems which make use of techniques involving the training of models with data shall be developed on the basis of training, validation and testing data sets that meet the quality criteria referred to in paragraphs 2 to 5 as far as this is technically feasible according to the specific market segment or scope of application. (Article 10)

It’s important for your AI system, designed to monitor and manage workload in an organisation, to use training, validation, and testing datasets of high quality. Considering that your AI system could be classified as high-risk (like AI systems used for recruitment or for making decisions affecting work-related contractual relationships mentioned in Annex III), it’s critical to comply with these quality criteria.

Training, validation and testing data sets shall be subject to data governance appropriate for the context of use as well as the intended purpose of the AI system. Those measures shall concern in particular, relevant design choices, transparency as regards the original purpose of data collection, data collection processes, data preparation processing operations, the formulation of assumptions and examination in view of possible biases. (Article 10)

The data governance procedures of your AI system should align with its context of use and intended purpose. You must be able to demonstrate transparency about why the data was collected, how it was collected and processed, and the assumptions made during these processes. Biases should be inspected and successfully managed.

Where the provider cannot comply with the obligations laid down in this Article because that provider does not have access to the data and the data is held exclusively by the deployer, the deployer may, on the basis of a contract, be made responsible for any infringement of this Article. (Article 10)

If you as an AI system developer do not have direct access to the needed data because the organisation controlling this data is different from your own, contract-based stipulations might allow to shift responsibility for complying with Article 10 onto the organisation deploying your AI system.

In the AI regulatory sandbox personal data lawfully collected for other purposes may be processed solely for the purposes of developing and testing certain AI systems in the sandbox when all of the following conditions are met. (Article 54)

Article 54 introduces AI regulatory sandboxes, which allow for the lawful processing of personal data gathered for other purposes if it’s used exclusively for developing and testing AI systems and all prescribed conditions are met.

Member States shall promote research and development of AI solutions which support socially and environmentally beneficial outcomes, including but not limited to development of AI-based solutions to increase accessibility for persons with disabilities, tackle socio-economic inequalities, and meet sustainability and environmental targets, by: providing relevant projects with priority access to the AI regulatory sandboxes to the extent that they fulfil the eligibility conditions. (Article 54a)

Member States encourage the development of AI systems that pursue social and environmental benefits. AI systems addressing socio-economic disparities or contributing towards sustainability goals may qualify for preferential access to the AI regulatory sandboxes.

Please always keep in mind that the AI Act defines strict rules on data governance, enforcing high-quality datasets use, proper data handling, and robust security measures. Compliance with the principles of transparency and eradication of biases is essential. Furthermore, if your AI system processes personal data, do not overlook EU data protection laws, such as the General Data Protection Regulation (GDPR).