What to expect when you’re expecting the first GenAI enforcement decisions under the GDPR: A look at the EDPB ChatGPT Taskforce Report

A taskforce of Data Protection Authorities (DPAs) from EU Member States set up by the European Data Protection Board (EDPB) about a year and a half ago published in May of this year a report summing up their common understanding of how ChatGPT, the popular sensation among Generative AI (GenAI) tools of the Large Language Model (LLM) variety, should meet the requirements of the General Data Protection Regulation (GDPR) when it comes to processing of personal data as part of its training and delivery of service.

The Taskforce was created last year in April, following the headline-grabbing order for the suspension of ChatGPT issued by the DPA (Garante) in Italy. Its purpose is “to foster cooperation and exchange information on possible enforcement actions conducted by data protection authorities”, with the Report making reference to several “ongoing investigations”, without mentioning how many or in what jurisdictions.

The DPAs left out of the Report their points of disagreement, and judging by its length, disagreements may be significant. The short Report gives insight into how DPAs could, if they wanted to, push developers of GenAI tools to respect requirements related to lawfulness, transparency, fairness, accuracy, rights of the “data subject” (which are the five key GDPR requirements tackled in the Report) in relation to how they scrap/otherwise collect, use, re-purpose, apply transformers to, retain, modify, create personal data.

While the Report does not represent an agreed-upon opinion of the EDPB, and it is merely intermediary – especially since none of the national investigations into ChatGPT initiated in early 2023 is yet completed (!), it does include some valuable nuggets of how DPAs are assessing the intrinsic tensions between GenAI (in particular LLMs) and data protection law.

For instance, it includes a list of the stages in the lifecycle of LLMs where DPAs agree processing of personal data occurs, it also provides suggestions on what type of measures would potentially allow the use of “legitimate interests” as lawful ground for processing of publicly available personal data for training purposes, and it positions the data protection by design obligation in the GDPR as keystone of being able to respect the rights of people with regard to the processing of their personal data occurring in the context of the “ChatGPT software infrastructure” (the revealing name used for the service by the regulators) .

As there is an expectation that DPAs should start publishing their individual decisions in the upcoming months, this analysis breaks down what to expect from the clash between the GDPR and LLMs.

The Report summarizes commonly agreed views among the DPAs members of the ChatGPT Taskforce

It is relevant to highlight from the outset that this Report is not an EDPB Opinion, Guidelines or Decision. This means its content has not gone through the adoption procedure that official EDPB documents go through. It has no binding nature and the other DPAs members of the EDPB are not held by its findings. In fact, not even the DPAs members of the Taskforce are held by its findings, as the disclaimer at p. 3 points out that “the positions presented in this document do not prejudge the analysis that will have to be made by the Supervisory Authorities in each investigation respectively.”

The same disclaimer also clarifies that the positions expressed in “this document” represent “the common denominator agreed” by DPAs in their interpretation of the GDPR “in relation to the matters that are in the scope of their investigations.” This means that it is likely there were points of disagreement which did not make their way into this report but which will be visible in the enforcement decisions of the individual DPAs which are currently handling complaints against ChatGPT. Interestingly enough, the EDPB did not publicly communicate who are the DPAs that are participating in this taskforce (not in the original press release about setting it up, nor in this Report).

The Report is nonetheless significant, particularly because it articulates the points that participating DPAs agree on when it comes to how the GDPR applies to LLMs.

1.1. While each DPA can reach its own decision independently, they may also have to forward proceedings to the Irish DPC

Since OpenAI, the company which launched ChatGPT, did not have an establishment in the EU until February 15, 2024, when it opened an office in Ireland, all EU DPAs can claim competence over how the processing of personal data of individuals living in their jurisdiction and undergone by ChatGPT up until that date complies with the GDPR.

From public communications elsewhere, there are at least three ongoing investigations into how ChatGPT complies with the GDPR in Italy (whose DPA already issued in January 2024 a preliminary finding that the GDPR has been breached, without publishing yet its final decision), Austria and Poland. Acting with independence, the DPAs in these countries can all reach their individual decisions without the obligation to coordinate their assessments. However, a footnote of the Report citing EDPB Opinion 8/2019 indicates that if infringements of a continuous nature are identified, meaning infringements beginning before the creation of a main establishment in the EU and continuing after its creation, OpenAI can benefit from the One Stop Shop, “and every pending proceeding should be transferred” to the lead DPA (footnote 8). In this case, the lead DPA would be the Irish DPC.

The Taskforce identifies five stages when LLMs process personal data

The Report contains little to no detail on what particularly is subject to investigation, or on the number and type of investigations – for instance, whether they are following complaints or are a result of own-motion activity. It merely mentions that “several supervisory authorities have initiated data protection investigations” (para 2) and it only specifies that the investigations concern OpenAI as “controller for processing operations carried out in the context of the ChatGPT service” (para 2).

The only technical exploration of the object of the investigations is the definition of LLMs hidden under Footnote 1: the DPAs agreed that LLMs are “deep learning models (a subset of machine learning models) that are pre-trained using vast amounts of data. Analyzing these massive datasets enables the LLM to learn probability relationships and become proficient in the grammar and syntax of one or more languages. LLMs generate coherent and context-relevant language. To put it simply, LLMs respond to human language by producing coherent text that appears human-like. Most recent LLMs such as OpenAI’s GPT models are based on a neural network architecture called a transformer model”.

The existence of “processing of personal data” is paramount – in its absence the GDPR and its suite of safeguards, obligations and rights is not applicable. If you have read a bit about how Large Language Models work, you may have wondered whether word vectors or word embeddings are personal data sometimes, always, or never. And if they are, under what conditions? The Taskforce merely states as a fact that “LLMs (Large Language Models – n) are trained and enhanced using a huge amount of data, including personal data” (para 1). But this is indeed enough to bring LLMs within the GDPR realm.

While there is no analysis on what constitutes “personal data” in the amalgamation of data required for the functioning of ChatGPT, and no analysis on specifically what constitutes “processing” of such data, the DPAs provide important nuance by recognizing that processing of personal data occurs at five different stages in the life cycle of LLMs (para 14). Two of these stages happen prior to training and none seemingly is relevant to what happens in between “prompts” and “output”:

(1) “collection of training data (including the use of web scraping data or reuse of datasets)”;

(2) “pre-processing of the data (including filtering)”;

(3) “training”;

(4) “prompts and ChatGPT output”;

(5) “training ChatGPT with prompts”.

Therefore, the DPAs members of the Taskforce see processing of personal data occurring at five stages within the lifecycle of an LLM. Clearly, collecting data for the purpose of training an LLM, including when this involves web scraping, amounts to at least some processing of personal data. Similarly, the DPAs agreed that pre-processing of the training data, including filtering it, also amounts to “processing of personal data” covered by data protection law, as does the training itself.

Interestingly, the DPAs bundle prompting an LLM and the output received by users pursuant prompting into the same stage in the lifecycle of an LLM that can include “processing of personal data”. Notably, DPAs do not specifically refer to what happens in between prompting and output as including processing of personal data. However, they do separate “training ChatGPT with prompts” as a distinct type of processing personal data.

Training an LLM with prompts is seemingly different from the initial training, since it involves interacting with the model once it has been initially trained, and it is also different from the process occurring in between prompting and output in one instance of use of an LLM. So far, this latter particular stage seems to be outside of the scope of “processing personal data” under the scrutiny of the DPAs members of the Taskforce.

Contrast this succinct approach in the Report to the detailed preliminary analysis published a couple of months later by the Hamburg DPA (one of the 16 German state-level DPAs) in July and reaching the conclusion that “no personal data is stored in LLMs”, or to the Irish DPC’s guidance on Generative AI, also published in July, which, on the contrary, stated that “some AI models have inherent risks relating to the way in which they respond to inputs or prompts, such as memorisation, which can cause passages of (personal) training data to be unintentionally regurgitated by the product”.

The European Data Protection Supervisor added its own preliminary assessment of the matter in Guidelines published for EU institutions about their use of Generative AI in June, stating briefly that processing of personal data in a Generative AI system occurs “when creating datasets, at the training stage itself, by inferring new or additional information once the model is created and in use, or simply through the inputs and outputs of the system once it is running”. The EDPS also noted that “processing of personal data in a Generative AI system can occur on various levels and stages of its lifecycle, without necessarily being obvious at first sight” (page 7).

No similar assessments made it through the EDPB taskforce report, which could either indicate that whether personal data is processed within the model is one of the contentious points at EDPB level, or that the DPAs agree to only focus on the five stages of processing personal data listed in the Report. This point will need to be solved soon though, considering the Irish DPC submitted a request for an Article 64(2) GDPR Opinion at the beginning of September, inviting the EDPB to consider, among other things, “the extent to which personal data is processed at various stages of the training and operation of an AI model, including both first party data and the related question of what particular considerations arise, in relation to the assessment of the legal basis being relied upon by the data controller to ground that processing”.

The Report enumerates measures that could allow Legitimate Interests to be used for processing personal data by LLMs’ providers

The Taskforce analyzes lawful grounds for processing as connected with the five stages of processing of personal data identified in the context of LLMs, as explored in Section 2. The DPAs agreed that the five stages enumerated above can be grouped into two different “buckets” of processing activities, each of them warranting its own analysis when it comes to lawful grounds for processing. The first group includes “collection of training data, pre-processing of the data and training”, while the second one includes “input, output and training”.

Notably, the only lawful ground that the Taskforce analyzes in the short Report for both “buckets” is necessity for the legitimate interests of the controller (Article 6(1)(f) GDPR). This is mainly because OpenAI brought forward this lawful ground for the various processing of personal data it is controlling (paras 16 and 21) as part of the investigations conducted by the members of the Taskforce.

When it comes to “collection of training data, pre-processing of the data and training”, the DPAs focus on “web scraping”. They recall the three-pronged test that controllers need to comply with in order to process personal data based on “legitimate interests”, including the (i) existence of such an interest, (ii) the necessity of processing, as the personal data should be adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed, and (iii) the balancing of interests between the legitimate interests of the controller and the fundamental rights and interests of the data subject. The Report recalls that the reasonable expectations of data subjects should be taken into account in this assessment (para 16).

Importantly, where the balancing test shows an imbalance between the two, the controller can put safeguards in place to limit the impact on the rights and interests of data subjects and re-establish an equilibrium that would allow it to process personal data on Article 6(1)(f) GDPR. Noting that “the assessment of the lawfulness is still subject to pending investigation”, the Taskforce points out what are some of the measures that can re-balance the scale in favor of the controller (paras 17 and 19):

Defining precise collection criteria;
Ensuring that certain data categories are not collected;
Ensuring that certain sources (such as public social media profiles) are excluded from data collection;
Delete or anonymise personal data that has been collected via web scraping before the training stage;
Filtering sensitive personal data falling under Article 9 GDPR, at collection (for example, selecting criteria for what data is collected) and immediately after collection (deleting data).

As for sensitive data, the Taskforce agreed that scraped data can also contain sensitive data regulated by Article 9 GDPR, and that when this happens one of the exceptions of Article 9(2) GDPR must be applicable in addition to a lawful ground for processing under Article 6 GDPR. The Report only specifically refers to the exception under letter (e), concerning sensitive data manifestly made public. The Taskforce notes (para 18) that “the mere fact that personal data is publicly accessible does not imply that the data subject has manifestly made such data public”, and that “it is important to ascertain whether the data subject had intended, explicitly and by a clear affirmative action, to make the personal data in question accessible to the general public”.

The second “bucket” of processing, consisting of “input, output and training”, is treated more succinctly in the Report. The Taskforce does note that, while relying on “legitimate interests” as lawful ground, OpenAI “provides the option to opt-out of the use of “Content” for training purposes” (para 21), with “content” being designated by OpenAI as meaning input of data subjects when interacting with LLMs, file uploads, and user feedback regarding the quality of output (or the responses) of ChaptGPT (para 21). Without being explicit, the regulators seem to welcome the existence of the opt-out, but they note in the Report that “data subjects should, in any case, be clearly and demonstrably informed that such Content may be used for training purposes”. They also note that this fact will be considered as part of the “balancing test” for being able to rely on legitimate interests for this stage of the processing (para 22).

“Fairness” in EU Data Protection Law finally meets AI

One of the key and at the same time least explored requirements in EU data protection law is that any processing of personal data must be done “fairly” – which has roots not only in Article 5 of the GDPR, but also in Article 8 of the EU Charter of Fundamental Rights, having thus constitutional significance in the legal order of the EU. This legal requirement will become increasingly poignant in the era of AI, where “fairness” has been treated so far either as a desideratum for machine learning in “ethical computer science” environments and given mathematical formulae, or as a slogan for ethical (and non-binding) “trustworthy AI” governance frameworks.

The Taskforce of regulators dealing with complaints against ChatGPT hint to the increased role that “fairness” will play in such data protection cases by singling out in a dedicated section of the Report considerations around it. They recall that the principle of fairness as laid out in Article 5(1)(a) GDPR is “an overarching principle which requires that personal data should not be processed in a way that is unjustifiably detrimental, unlawfully discriminatory, unexpected or misleading to the data subject” (para 23). Making a reference to the older EDPB Guidelines on Data Protection by Design and by Default, the Taskforce also highlights that “a crucial aspect of fairness is that there should be no risk transfer, meaning that controllers should not transfer the risks of the enterprise to data subjects” (para 23).

In the specific context of ChatGPT, the regulators agreed that “this means that the responsibility for ensuring compliance with GDPR should not be transferred to data subjects, for example by placing a clause in the Terms and Conditions that data subjects are responsible for their chat inputs” (para 24). The Taskforce specifically noted that “if ChatGPT is made available to the public” and if the inputs “become part of the data model and, for example, are shared with anyone asking a specific question, OpenAI remains responsible for complying with the GDPR and should not argue that the input of certain personal data was prohibited in the first place” (para 25).

Transparency, Accuracy and Data Subject Rights in the context of ChatGPT are all subject to investigation

The Report includes three distinct sections dedicated to transparency, accuracy and the rights of the data subject, each of them with brief notes.

When it comes to transparency obligations related to the personal data scraped from the web for training purposes, the DPAs members of the Taskforce agreed that “it is usually not practicable or possible to inform each data subject about the circumstances. Therefore, the exemption pursuant Article 14(5)(b) GDPR applies, as long as all requirements of this provision are fully met” (para 27). These conditions refer to taking appropriate measures to protect the data subject’s rights and freedoms and legitimate interests, including making publicly available the information that should be disclosed via individual notice. The situation is different in relation to the processing of personal data from prompts, which falls under the Article 13 GDPR transparency obligations, since the personal data is collected directly from users. The DPAs note that it is particularly important to disclose the fact that prompts and other interactions may be used for training purposes (para 28).

In relation to accuracy, the DPAs seem to be sympathetic that inaccuracies are in the nature of the output of ChatGPT, noting that “due to the probabilistic nature of the system, the current training approach leads to a model which may also produce biased or made up outputs” (para 30). However, the Taskforce states unequivocally that “in any case, the principle of data accuracy must be complied with” (para 30) and refers to CJEU case-law establishing that “all processing of personal data must comply with the principles related to data quality set out in Article 5 GDPR” (C-496/17, Deutsche Post). In this sense, the Taskforce notes that transparency efforts to inform users about the potential inaccuracy of outputs “are not sufficient to comply with the data accuracy principle” (para 31).

Finally, when it comes to Data Subject Rights, such as access, deletion and rectification, the regulators from the Taskforce focus on the existence of various avenues to exercise such rights – through account settings or via email requests. The right to rectification is particularly called out when the Taskforce notes that the controller should “continue improving the modalities provided for facilitating” the exercise of data subject rights, noting that “at least for the time being, OpenAI suggests users to shift from rectification to erasure when rectification is not feasible due to the technical complexity of ChatGPT” (para 35).

The Data Protection by Design and by Default obligation shapes up as the Keystone in the GDPR analysis of LLMs

Interestingly, the Report immediately adds to the point above of technical complexity invoked against achieving rectification that Article 25(1) GDPR on Data Protection by Design and by Default requires controllers that both at the time of the determination of the means for processing and at the time of the processing itself they must “implement appropriate measures designed to implement data protection principles in an effective manner and to integrate the necessary safeguards into the processing in order to meet the requirements of the GDPR and protect the rights of the data subjects” (para 35).

This is the last paragraph of the Report, which closes a loop opened at the beginning of the whole analysis, back in the Introduction (para 7), where the Taskforce notes that controllers are responsible and accountable to ensure compliance with all requirements of the GDPR, and that “in particular, technical impossibility cannot be invoked to justify non-compliance with these requirements, especially considering that the principle of data protection by design set out in Article 25(1) GDPR shall be taken into account at the time of the determination of the means for processing and at the time of the processing itself”.

This may mean that the obligation of Data Protection by Design can play a significant role as a bridging element between the technical complexity of LLMs, the accountability principle and the imperatives of valuing the rights of individuals with regard to the processing of their personal data in the EU.

The detailed Questionnaire sent to OpenAI shows just how relevant the whole GDPR is for providers of LLMs that want to provide services in the EU

Finally, the Report includes an Annex that provides a model questionnaire the Taskforce created for the benefit of other DPAs which receive complaints related to ChatGPT or other LLMs and wish to send a questionnaire to controllers as part of their fact finding. Having a close look at the Questionnaire reveals just how relevant the elements of GDPR compliance are for processing of personal data in the context of LLMs.

The Questionnaire is divided in seven parts:

(I) A General section, inquiring about a description of the ChatGPT service, the existence of a DPO, or a copy of the record of processing activities;

(II) A section on Principles Relating to the Processing of Personal Data, which asks whether a data protection management system has been implemented, whether regular internal and/or external audits are carried out, or asks OpenAI to describe the different purposes for which they process personal data in the context of the “ChatGPT software infrastructure”, or to describe how they measure the accuracy of the personal data used in the language model training, testing and validation purposes, among other questions.

(III) There is a detailed section on Data Protection Impact Assessment and Risk Management, with multiple questions, such as whether a DPIA was conducted, or if not, an explanation of why the conditions for conducting a DPIA do not apply to the processing of personal data carried out by ChatGPT; whether the DPO was involved in carrying out the DPIA, if one exists; and if the DPO was involved, what was the assessment of the DPO provided for the DPIA; the risk analysis and mitigation measures for the risks identified as part of the DPIA.

(IV) A section on Lawfulness of Processing, also very detailed and honing into questions related to three possible lawful grounds – consent, necessity to enter a contract and legitimate interests, each of them with detailed sub-questions to clarify the reasoning of why they were used, if they were used, in addition to describing all the different sources from where personal data are collected.

(V) Another section deals with the Rights of the Data Subject and Transparency, where one question among many is if OpenAI can provide an internal policy dedicated to the handling of data subject requests, and another question is whether automated individual decision-making takes place, including profiling.

(VI) A section is dedicated to Transfers of personal data to third countries or international organizations, and drills into establishing the existence of data transfers from the EU and the mechanisms for transfers used.

(VII) Finally, a lest section is dedicated to Disclosure of personal data to other parties, and it includes questions about the existence of joint controllers, processors, third parties, with the request to take into consideration “the integration of the ChatGPT software infrastructure into other products, such as (but not limited to) search engines”.

Going through the whole questionnaire should remind providers of LLMs how close GDPR rules are to their data collection and use.

What now?

By the time of the publication of this analysis of the ChatGPT Taskforce Report, which comes almost five months after the Report was published, there are still no final or public decisions of EU DPAs in any of the ChatGPT investigations that prompted the creation of the Taskforce a year and a half ago. It is impossible to tell when the decisions will start being published, as it is also impossible to tell how many of them will be transferred to the Irish DPC as lead authority, if the investigations will lean towards the existence of infringements of a continuous nature (see Section 1.1 above). Regardless, the investigations cannot remain open forever, and the very close look at this Taskforce Report is the closest we can get to the current thinking of EU DPAs and how they intend to apply the GDPR to processing of personal data in the context of Generative AI, and LLMs in particular.

What is certain is that questions of EU data protection law go deep into how Large Language Models involve personal data at various stages in their life cycle, and such questions will not go away, they will become more acute.

One response to “What to expect when you’re expecting the first GenAI enforcement decisions under the GDPR: A look at the EDPB ChatGPT Taskforce Report”

Rekayasa Perangkat Lunak

October 25, 2024

How might the obligation of Data Protection by Design help bridge the gap between the technical complexity of large language models (LLMs) and the accountability principle under GDPR?
Regard IT Telkom

Loading…

pdpEcho