Background Information
I. System Summary
ChatGPT is an Artificial Intelligence (AI) powered Large Language Model (LLM), which uses predictive text generation and reinforcement learning to respond to human input in a manner resembling a human. The technology utilizes parameters, which are "variables in an AI system whose values are adjusted during training to establish how input data get transformed into the desired output". ChatGPT is developed to be an Artificial General Intelligence (AGI) model, or a model designed to mimic human cognitive abilities in breadth of knowledge and communication abilities. This is as opposed to specialized AI, which "is designed to accomplish one relatively predictable task repeatedly".
While the free version of ChatGPT (3.5) is currently restricted to text inputs only, ChatGPT 4.0 allows users to submit queries through text, image, or voice prompts. ChatGPT is produced by OpenAI, a for-profit company based in San Francisco, California.
II. Scope of PIA
This Privacy Impact Assessment (PIA) analyzes ChatGPT version 4.0, which is, at time of writing, only offered to paid ChatGPT subscribers. ChatGPT 4.0 offers a range of features beyond what is available in ChatGPT 3.5, and opens the door to a multitude of additional privacy concerns as a result. With that understanding, many of the privacy concerns (specifically concerns revolving around text input) addressed in this document are applicable to version 3.5 as well. The Enterprise version of ChatGPT was not tested in this analysis.
The focus of this PIA is on privacy concerns of users within the United States, and any policy considerations addressed are based on U.S. policy. Again, many of the privacy concerns addressed are applicable to users in other countries. Additionally, this PIA provides a thorough analysis of ChatGPT under the General Data Protection Regulation (GDPR), the European Union's comprehensive data privacy rights act, which has shaped the adoption of comprehensive data privacy rights acts outside of European countries.
III. Methodology
The production of this PIA is based on:
- Direct testing of ChatGPT;
- The OpenAI Privacy Policy;
- An analysis of ChatGPT system architecture and data storage; and
- Secondary research sources.
It is important to state the limitations of conducting this research. Most notably, this PIA was conducted without any insider knowledge from ChatGPT, and was not based on direct access to any developmental or restricted version of ChatGPT.
Data Sources and Flow
I. Personally Identifiable Information Collected
Given the nature of ChatGPT, and its ability to collect broad user input, ChatGPT is capable of collecting and retaining a multitude of types of Personally Identifiable Information (PII). However, very little PII is required for the creation of an account.
When creating an account, a user can register with their email address, Google account, Microsoft account, or Apple account. Users must provide an 8-character password, verify their email address, provide their first/last name, and verify their phone number with a 6-digit 2-factor authentication code. To upgrade to ChatGPT 4.0, users will also be prompted to provide standard billing information. Information collected includes:
- Email address (required)
- 8-character password (required)
- Date of birth (required)
- First name (required)
- Last name (required)
- Phone number with verification (required)
- Billing address (required for version 4.0 only)
- Credit card information (required for version 4.0 only)
Users are also required to provide their date of birth as part of the sign-up process, which is used to ensure that users are old enough to have an account. OpenAI states that ChatGPT is not meant for those under the age of 13, and parental consent is required for users age 13 to 18.
II. Sources of Data Collection
Outside of the data collected in the account creation process, ChatGPT receives information from the following types of user input:
- Text prompts;
- Media files; and
- Voice prompts.
Additionally, ChatGPT collects information regarding usage data, including standard analytics and log files.
III. Sharing and Sale of Data
The OpenAI privacy policy states: We don't "sell" Personal Information or "share" Personal Information for cross-contextual behavioral advertising (as those terms are defined under applicable local law). We also don't process Personal Information for the purposes of inferring characteristics about a consumer.
Cross-contextual behavioral advertising is defined under the California Privacy Rights Act as "the targeting of advertising to a consumer based on the consumer's personal information obtained from a consumer's activity across businesses, distinctly-branded websites, applications, or services, other than the business, distinctly-branded website, application, or service with which the consumer intentionally interacts". This phrase is defined differently in other local laws, but generally refers to the utilization of PII to target consumer advertisements based on aggregated data from multiple businesses or web systems.
It is unclear what OpenAI constitutes as "Personal Information", a term which they failed to clearly define in their privacy policy. While OpenAI says they will not sell personal information, the privacy policy does not state that they won't sell or share information which they do not constitute as "Personal Information". This is a significant ambiguity for users attempting to understand what protections apply to their data.
It is also worthwhile to note that it is unclear what OpenAI constitutes as 'Personal Information', a term which they failed to clearly define in their privacy policy. They do outline that personal information broadly includes account information, user content, communication information, social media information, log data, usage data, device information, cookies and analytics. While OpenAI says they will not sell personal information, the privacy policy does not state that they won't sell or share information which they do not constitute as 'Personal Information'.
The privacy policy acknowledges that data may be disclosed to:
- Vendors and service providers;
- Those involved in business transfers;
- Government authorities, industry peers, or other third parties (as required by law); and
- Affiliates.
It is worth noting that ChatGPT is hosted on the Microsoft Azure cloud server, making Microsoft a vendor of ChatGPT. Additionally, Microsoft has invested $13 billion in OpenAI, and ChatGPT has already begun integration with various Microsoft products [4]. The current version of the OpenAI privacy policy allows ChatGPT to share data with vendors, which would include Microsoft. The privacy policy does state that "these parties will access, process, or store Personal Information only in the course of performing their duties to us". However, the policy does not outline what those duties are, or whether information which OpenAI does not constitute as 'Personal Information' can be accessed, processed, or stored in a manner outside of what is necessary for performing duties to OpenAI.
Additionally, OpenAI left the definition of affiliates very broad and inclusive, while also failing to acknowledge any specific affiliates. The official language from the privacy policy states: "We may disclose Personal Information to our affiliates, meaning an entity that controls, is controlled by, or is under common control with OpenAI. Our affiliates may use the Personal Information we share in a manner consistent with this Privacy Policy."
The privacy policy also discusses the possibility of data aggregation and de-identification of Personal Information; however, it does not provide information on the specific techniques or technologies used to do so. It is unclear how this data may be sold or shared once it is aggregated or de-identified.
IV. Use of Personal Information
Personal information is used by ChatGPT in two primary ways:
- Responding to user input; and
- Improving the product.
In regard to the first, there are cases in which personal information is necessary to provide accurate answers to the user input, which may take the form of text, a media file, or voice input. For example, if a user asks ChatGPT questions about a medical concern they have, ChatGPT may need certain demographic information to answer the question accurately.
In regard to the second, it is important to consider the role of reinforcement learning in the context of AI models. ChatGPT utilizes user feedback (positive or negative) to "teach" the model which of its responses are correct, and which ones are inaccurate or need revision. One example of this would be a user who asks the same question to ChatGPT in slightly different ways due to the failure of ChatGPT to answer the question the first time. ChatGPT can recognize that it failed to answer the question correctly the first time since the user repeated a variation of the question. In this case, the user's data is used for assisting in reinforcement learning (the improving of the ChatGPT product), even without explicit consent from the user. The user does have the option of excluding their data from being used for model training in their ChatGPT privacy settings.
V. Data Security and Storage
ChatGPT utilizes OpenAI servers for the storage of data, and uses end-to-end encryption for data in transit, in addition to storing sensitive information (such as passwords) in an encrypted format. ChatGPT has had one major breach, with approximately 1.2% of ChatGPT Plus subscribers having their information leaked; however, this was a result of a bug in an open-source library used by OpenAI (Redis), and not a result of poor data storage [6]. There have been concerns about how injection prompts could be utilized to inject malicious SQL code into the ChatGPT servers, but OpenAI has taken efforts to mitigate this risk. ChatGPT does face an additional risk resulting from reliance on a significant number of packages and open-source libraries, which could experience vulnerabilities or breaches.
Privacy Concerns
I. Concerns from Prompts
ChatGPT, like all LLMs, is vulnerable to injection-based prompts, which can pose serious privacy concerns. Through prompts, it is easy to get ChatGPT to ignore the safeguards it was programmed with, leading to possible breaches of privacy, as well as other concerns. Overriding these programmed safeguards is not difficult.
ChatGPT, like all LLMs, is vulnerable to injection-based prompts, which can pose serious privacy concerns. ChatGPT is also programmed not to answer certain prompts, which OpenAI deems unethical or concerning from a privacy or safety standpoint. However, ChatGPT is also relatively easy to trick, and overriding these safeguards is not difficult. Through prompts, it is easy to get ChatGPT to ignore the safeguards it was programmed with, leading to possible breaches of privacy, as well as other concerns.
An additional concern results from the uploading of images to ChatGPT, which could provide concerns about the ability of ChatGPT to infer sensitive information about you, such as location data. For example, ChatGPT can infer a user's location from a photo of a well-known landmark, and has demonstrated the ability to recognize city skylines with no context provided. The ability of ChatGPT to infer information from images may lead to privacy concerns that users are unaware of. For example, a user could upload an image to ChatGPT with a skyline in the background, having no idea that ChatGPT can infer that information from an analysis of the buildings and highway system.
II. Image/Media Input
As seen in the example provided above, image and media inputs can pose significant concerns. Additionally, images provide a privacy concern given that the metadata from the image can be easily stored and read by ChatGPT.
III. Underage Use
ChatGPT is not intended to be used by those under the age of 13, and requires parental consent for those between the age of 13 and 18. However, testing found that ChatGPT required no age verification (such as through a government-issued ID) and, more concerningly, knowingly chatted with a user it believed to be underage, even providing tailored book recommendations "suitable for young readers".
As noted previously in this document, ChatGPT is not intended to be used by those under the age of 13 at all, and requires parental consent for those between the age of 13 and 18. However, it is extremely easy for those under this age to use ChatGPT anyway. When creating a ChatGPT account, our testing found that ChatGPT required no age verification (such as through a government-issued ID). Additionally, not only did ChatGPT knowingly chat with a user it believed to be underage, but it provided book recommendations "suitable for young readers", providing tailored recommendations to an underage user.
IV. Voice Input
As of September 2023, ChatGPT 4.0 can now accept and respond to voice inputs [7], which can lead to various additional privacy concerns including the possibility of deepfakes and ChatGPT being able to infer information about you. For example, based on your voice input, ChatGPT could infer your gender, age, race, and other demographic information.
V. Links to Chat
ChatGPT 4.0 allows users to share a link to their chats, which could lead to their chat being indexed on the web. While there are no known examples of ChatGPT indexing private chats on the web, it is worth noting that this has happened with Google's Bard (a competitor of ChatGPT) [8].
VI. Human Review
According to their website, ChatGPT utilizes human reviewers for the following four reasons:
- Investigating abuse or a security incident;
- Providing support when a user reaches out for account-related questions;
- Handling legal matters; and
- Improving model performance.
It is worth noting that users do have the option of opting out of having their data used for improving model performance [9].
GDPR Analysis and Rights of Users
The General Data Protection Regulation (GDPR) is a piece of legislation from the European Union, which is designed to provide comprehensive data privacy. Some variation of the GDPR is used in many countries around the globe, and U.S.-based companies place a strong emphasis on maintaining compliance in order to send and receive data from the countries which do operate under GDPR (particularly European countries).
GDPR contains many key requirements, including limitations on the data collected and a requirement for lawful, fair, and transparent processing [10]. Additionally, there are eight data subject rights which are outlined in the GDPR, which will be the focus of this analysis.
The following eight rights are analyzed in turn below, with an assessment of how ChatGPT 4.0 handles each right under the GDPR framework.
1) The right to be informed: ChatGPT communicates information about the collection and sharing of data in the OpenAI privacy policy. ChatGPT also communicates information about data management on their blog. One requirement of the GDPR is that this information is concise and written in plain language. According to ChatGPT, the current OpenAI privacy policy is written at a college level and is fairly long, which likely does not meet this requirement of GDPR.
2) The right of access: GDPR outlines the right of individuals to request Data Subject Access Requests (DSARs), in which case the organization has one month to provide this data (with some exceptions). OpenAI does allow individuals to submit this form of request, but does not state how quickly they will respond to these requests.
3) The right to rectification: As in the right of access, users may email OpenAI requesting that their information is updated in the event that it is not accurate. Once again, OpenAI makes no promise regarding the timeline (or likelihood) of such a request being honored.
4) The right to erasure: Under GDPR, there are certain scenarios in which the user can request the erasure of their data. If a user of ChatGPT wished to delete data beyond what can be deleted from within the ChatGPT privacy settings, they would need to file a request with the OpenAI privacy department via email.
5) The right to restrict processing: Under GDPR, users have the right to limit the way that organizations use their information as an alternative to deletion, in the event that the organization must retain the information for the establishment, exercising, or defense of a legal claim. Based on all available data from ChatGPT, there is no formal process for this beyond an email request, as with the previous three rights.
6) The right to data portability: This right allows users to use personal data they have provided to a data controller through a contract or consent across different services, and this is permissible for users of ChatGPT. For users who wish to obtain data beyond what is available in their ChatGPT account, they must submit an email request to OpenAI.
7) The right to object: Under GDPR, individuals have the ability to stop organizations from processing information for which there is no demonstrated reason for processing that supersedes the interests of the individual. Based on all currently available information, there is no process for users of ChatGPT to exercise this right.
8) The right to opt out of automated decision making, including profiling: An individual who feels that an organization has not followed the extensive rules about the use of automated decision making may request a human review of the processing under the GDPR. As in the right to object, there is currently no evidence suggesting a formalized process for the exercising of this right at ChatGPT. It is important to note that organizations may utilize ChatGPT for the purpose of automated decision making or profiling; however, this would likely be done under the Enterprise version of ChatGPT, and not the consumer version analyzed in this privacy impact assessment.
Summary of Risks and Mitigation Recommendations
The primary privacy risks of using ChatGPT 4.0 can be categorized into the following:
- Risks from data storage and unauthorized access;
- Risks from failure to follow safeguards;
- Risks from human reviewers;
- Risks from the sharing of data; and
- Risks from media, voice, and image-based interactions.
1. Risks from Data Storage and Unauthorized Access
Given that ChatGPT is a cloud-based system, a significant amount of data is stored on remote servers. This can ultimately lead to privacy concerns resulting from unauthorized access, particularly in the form of a data breach. Since users of ChatGPT do not have to provide a large amount of data to create an account, it is likely that there are many users who are not aware of just how much information ChatGPT has about them. For example, users may provide sensitive information in the chat, which they are unaware is being stored on OpenAI servers. OpenAI does give ChatGPT users the ability to delete this data; however, it is stored by default and many users may be unaware of this ability.
ChatGPT fails to provide much information about the steps they are taking to ensure data security. Specifically, users should be aware of what data is collected in a raw text format versus an anonymized or redacted format, and the physical location of the servers their data will be stored on. Users should also be more explicitly informed of their right to delete their data, and their right to not have their data used in the training of the ChatGPT model. While having data used in the training process is currently an opt-out feature, it would be better to offer to users as an opt-in feature from a privacy standpoint.
2. Risks from Failure to Follow Safeguards
Based on the sample interactions provided, and many more through extensive testing, it is clear that ChatGPT often fails to follow the safeguards it was programmed with. This causes privacy concerns, as it is difficult (if not impossible) for the user to verify that ChatGPT is following the privacy safeguards it was programmed with. While ChatGPT may state that it is not storing your sensitive information, this is difficult to prove. Additionally, it is particularly concerning that ChatGPT allows underage users to utilize the platform, given the additional privacy concerns of this system for children.
Mitigation recommendations to address these risks would include:
- Requiring a government ID at sign-up to validate user age;
- Building stronger controls in the chat interface to ensure underage users cannot utilize the platform; and
- Giving users an easy way to report cases where safeguards were not followed.
3. Risks from Human Reviewers
The risks posed by human reviewers are concerning to many users. Again, a large part of this risk comes down to a lack of transparency from OpenAI on who these reviewers are, where they are located, what the purpose of the review is (particularly for reviewers focused on model training), and what level of anonymization the information undergoes prior to human review. It is also unclear if privacy requests (such as the obtaining of personal information) are processed manually with human reviewers or through automated systems. However, if these requests are processed through the use of human reviewers, this is an additional privacy concern.
Similar to the first risk category, ChatGPT providing additional information would be beneficial for users to be able to make informed decisions about which rights they wish to exercise, which data they want to provide to ChatGPT through chat, and whether they want to use the platform in the first place. Users of ChatGPT should also consider turning off the setting allowing ChatGPT to use their data for model training purposes.
4. Risks from the Sharing of Data
As documented in the section on data sharing, there are several sections of the OpenAI privacy policy which are ambiguous on the policies regarding the sharing of data, and particularly data which is not considered personally identifiable. This is especially concerning since OpenAI does not provide a definition for this important term.
Users of ChatGPT should operate under the assumption that any information they provide outside of specific identifiable information (such as full name) may be provided to third parties and affiliates. OpenAI should also make an effort to clarify the language used in their privacy policy to address these concerns, and inform users of any opt-out options.
5. Risks from Media, Voice, and Image-Based Interactions
The additional risks of media, voice, and image-based interactions include providing ChatGPT with sensitive metadata (such as geolocation data), and demographic information (such as age and race).
Whenever possible, ChatGPT users should not provide information to ChatGPT in media, voice, or image-based formats. Instead, users should utilize text-based interactions to avoid many of the privacy concerns presented by the uploading of media, voice, and image files.
References
- N. Kenney, "A Brief Analysis of the Architecture, Limitations, and Impacts of ChatGPT," Mar. 2023, doi: 10.5281/zenodo.7762245.
- "Number of parameters in notable Artificial Intelligence Systems," Our World in Data. [Online]. Available: https://ourworldindata.org/grapher/artificial-intelligence-parameter-count.
- N. Kenney, "A Brief Analysis of the Architecture, Limitations, and Impacts of ChatGPT," Mar. 2023, doi: 10.5281/zenodo.7762245.
- "Is ChatGPT safe for all ages?" OpenAI Help Center. Accessed: Oct. 26, 2023. [Online]. Available: https://help.openai.com/en/articles/8313401-is-chatgpt-safe-for-all-ages.
- "Privacy policy," OpenAI. Accessed: Oct. 26, 2023. [Online]. Available: https://openai.com/policies/privacy-policy.
- "What is cross-context behavioral advertising in CPRA?" CookieYes. Accessed: Oct. 26, 2023. [Online]. Available: https://www.cookieyes.com/knowledge-base/ccpa/cross-context-behavioral-advertising/.
- "Privacy policy," OpenAI. Accessed: Oct. 26, 2023. [Online]. Available: https://openai.com/policies/privacy-policy.
- J. Novet, "Microsoft's $13 billion bet on OpenAI carries huge potential along with plenty of uncertainty," CNBC. Accessed: Oct. 26, 2023. [Online]. Available: https://www.cnbc.com/2023/04/08/microsofts-complex-bet-on-openai-brings-potential-and-uncertainty.html.
- "Privacy policy," OpenAI. Accessed: Oct. 26, 2023. [Online]. Available: https://openai.com/policies/privacy-policy.
- "Privacy policy," OpenAI. Accessed: Oct. 26, 2023. [Online]. Available: https://openai.com/policies/privacy-policy.
- "Does ChatGPT Save Data?" Botpress Blog. Accessed: Nov. 19, 2023. [Online]. Available: https://botpress.com/blog/does-chatgpt-save-data.
- "ChatGPT Security and Privacy Issues Remain in GPT-4," eSecurity Planet. Accessed: Nov. 19, 2023. [Online]. Available: https://www.esecurityplanet.com/threats/gpt4-security/.
- "ChatGPT can now see, hear, and speak," OpenAI. Accessed: Nov. 19, 2023. [Online]. Available: https://openai.com/blog/chatgpt-can-now-see-hear-and-speak.
- "Google Indexing Public Bard Conversations In Search Results," Search Engine Journal. Accessed: Nov. 19, 2023. [Online]. Available: https://www.searchenginejournal.com/google-indexing-bard-conversations-in-search-results/497161/.
- "Data usage for consumer services FAQ," OpenAI Help Center. Accessed: Nov. 19, 2023. [Online]. Available: https://help.openai.com/en/articles/7039943-data-usage-for-consumer-services-faq.
- L. Irwin, "Summary of the GDPR's 10 key requirements," IT Governance Blog. Accessed: Nov. 19, 2023. [Online]. Available: https://www.itgovernance.eu/blog/en/summary-of-the-gdprs-10-key-requirements.
Discuss this research.
Interested in AI privacy policy, data governance, or the broader implications of ChatGPT's data practices?
Get in Touch