What is responsible data science? Is it different from responsible AI? To understand these domains, we must first define data science and AI and explore how they intersect.
The rise of data science and artificial intelligence (AI) across multiple sectors of society has prompted discussions surrounding responsibility and technological advancements [2]. Accordingly, we're seeing the advent of responsible data science (RDS) and responsible AI (RAI) movements. RDS and RAI are undoubtedly interconnected, and both aim to minimize potential harm from these innovations, but they are not synonymous. Before exploring what it means to practice responsibility in these domains, it is helpful to understand the differences between RDS and RAI. Let's begin by defining data science and AI.
Data science is an interdisciplinary field that combines many different elements with specific subject matter expertise to uncover actionable insights buried in data [9]. Data science integrates the scientific method with mathematical and statistical theories to analyze and interpret complex data sets. It requires specialized programming skills—with proficiency in languages such as Python, R, and SQL—to develop algorithms and models for data analysis and visualization. Advanced analytics and AI are crucial for solving complex problems and creating predictive models. These elements help dig deep into data to uncover patterns and trends, which are essential for making informed decisions.
Storytelling is another integral aspect of data science, helping the data scientist translate complex data findings into comprehensible narratives for non-expert stakeholders. This ensures that the insights extracted are not only understood but are also actionable. Additionally, possessing specific subject matter expertise allows for a nuanced understanding of data, thereby enriching the analysis and interpretation process. Data science plays a pivotal role in data-driven decision-making and strategic planning, empowering organizations worldwide to optimize operations, inform strategies, and drive innovation.
In most of its current business applications—such as predicting customer behavior—AI is a tool (and sometimes a product) in the data science field. Most modern AI systems rely heavily on large amounts of data which are assessed through a data science workflow. However, AI is an independent field with its own history and particular elements. AI's origins can be traced back to Alan Turing's pioneering 1950 paper "Computing Machinery and Intelligence," in which he proposed the famous Turing test to evaluate a machine's ability to exhibit intelligent behavior [8]. Computer scientist John McCarthy is credited for first coining the term "artificial intelligence" in 1955 and later defined it as "the science and engineering of making intelligent machines" [8, 15]. At its core, AI refers to the ability of machines to learn and implement appropriate strategies to solve problems and achieve goals in complex, uncertain environments.
The current development of AI focuses on creating systems that learn independently—similar to how humans learn—instead of just following programmed instructions. This type of learning, called machine learning (ML) aims to improve computers' abilities to understand data and take action by learning from experience. There are different types of ML:
Recently, an ML approach called deep learning has been very successful and has driven a lot of progress in the field. Deep learning uses neural networks modeled after the human brain to learn from large amounts of data. This allows AI systems to learn on their own in a human-like way more efficiently than previous approaches [8, 10, 13]. Deep learning has thus led to significant advances in AI's ability to learn autonomously. A great example is Generative AI, a class of deep learning models that powers applications like ChatGPT, Claude, and Midjourney [11, 14].
The 21st century's data revolution can, in part, be attributed to AI systems, but the impact of data science applications in society goes beyond automated models. An example of this is the statistical analysis of clinical trial data. Effective medical treatments can be identified using traditional methods without relying on AI.
Alternatively, not all AI algorithms fall under data science. For instance, the early implementations of AI in video games didn't rely on large datasets or sophisticated data analysis methods that are characteristic of data science. AI was used in games like Space Invaders (1978) and Pac-Man (1980) to generate challenging gameplay based on predefined logic and patterns rather than on learning from data or player behavior [3]. These AI implementations were primarily rule-based and did not involve data-driven insights that modern data science provides. So, although they're interconnected, data science and AI serve different primary purposes: the former is the process of discovering valuable insights and patterns in data using the scientific method while the latter is more oriented towards simulating human-like intelligence or behavior.
Data science and AI methods have brought visible positive impacts to the world through the exponential growth of data accessibility and computing process power within the last few decades. They provide insights, diagnostics, predictions, and automated decisions that have improved the lives of millions. These improvements range from the mundane, such as personalized movie or shopping recommendations, to the life-changing, like improved identification and prediction of diseases and individualized medical treatments. However, the irresponsible use of data can also produce serious negative consequences. It can have the effect of reinforcing systemic discrimination and bias, reaffirming economic and social inequities, violating privacy concerns, and have nontransparent consequences that further harm marginalized or susceptible groups [12]. Hence, data science and AI must be grounded in responsible practices to ensure a future of diminishing harm and amplifying good.
A call for responsible development of AI has emerged from the ubiquity of AI applications and their ever-growing impact on people's lives. Responsible AI was born in this environment—a term that refers to the development and use of AI systems in a way that is ethical, trustworthy, and beneficial for society [1, 17]. RAI is based on the idea that AI should not cause harm, violate human rights, or undermine human values. Instead, AI should be aligned with human goals, respect human dignity, and promote human flourishing [16, 17].
Different stakeholders (i.e., researchers, developers, regulators, etc.) have proposed various principles for RAI. Some of the common values include fairness, reliability, safety, privacy, security, inclusiveness, transparency, and accountability [1, 7, 16, 17]. These principles help ensure that AI systems are designed, deployed, and used in a way that respects human autonomy, protects human interests, and fosters human collaboration. Some of the standard practices include ethical design, impact assessment, governance frameworks, compliance mechanisms, and stakeholder engagement. These practices provide guidance, oversight, and feedback for developing and using AI systems.
Even though most modern data science applications and projects rely heavily on AI systems, a future where data delivers beneficial outcomes with little to no harm depends on the actions of the whole data science field. Data science is a multi-step process that starts with identifying the problem before any data is analyzed. Then, it involves acquiring, cleaning, modeling, and preparing data for analysis. Communication of insights is also an important step that includes visualization, summarization, and presentation. Negative consequences can arise in any phase of the process. A more subtle concern is that even the most well-intentioned model can still result in detrimental outcomes if supplied with biased or faulty data [12]. Each step must be developed with safeguards, human oversight, and consistent evaluation. Every action requires careful consideration, from acquiring data to delivering the final results, which are data themselves.
RDS expands the scope of responsible practices beyond just AI systems to encompass the entire data science process. Ensuring beneficial outcomes depends on responsible actions throughout project scope, data acquisition, preparation, analysis, and communication. By bringing together academic experts, seasoned data science professionals, and the founding members of its board, the Data Science Alliance has defined fairness, transparency, privacy, and veracity as the core foundational pillars of responsible data science. Paramount across the four principles is a broad effort to prevent harm. It is also important to clarify that where AI and data science intersect, RAI principles and practices are assimilated and upheld under RDS. To ensure that RDS is not just an idea, but a practice, the Data Science Alliance developed the free Framework for Responsible Data Science Practices to help data practitioners incorporate the principles of RDS into their day-to-day projects [4].
RDS and RAI are distinct but complementary in their effort to harness the power of data for the common good. While RAI focuses on developing intelligent systems that align with human-centric principles, RDS examines responsible data practices across the entire data lifecycle. Both are critical to maximizing societal benefits as data science and AI advance. By collaborating to embed RAI within RDS, stakeholders—from academics to companies and everything in between—can help these powerful technologies flourish while upholding human dignity. With principled guardrails and human oversight guiding the way, data science and AI can transform society for the better.