3 Functional Programming

What you’ll learn by the end of this chapter:

Why functional programming is crucial for reproducible, testable, and collaborative data science.
How to write self-contained, “pure” functions in both R and Python.
How to use functional concepts like map, filter, and reduce to replace error-prone loops.
How writing functions makes your code easier to review, debug, and even generate with LLMs.

3.1 Introduction: From Scripts to Functions

So far, we’ve established two pillars of reproducible data science:

Reproducible Environments (with Nix): Ensuring everyone has the exact same tools (R, Python, system libraries) to run the code.
Reproducible History (with Git): Ensuring everyone has the exact same version of the code and can collaborate effectively.

Now we turn to the third and arguably most important pillar: writing reproducible code itself. A common way to start a data analysis is by writing a script: a sequence of commands that are executed from top to bottom.

# R script example
library(dplyr)
data(mtcars)
heavy_cars <- filter(mtcars, wt > 4)
mean_mpg_heavy <- mean(heavy_cars$mpg)
print(mean_mpg_heavy)

# Python script example
import pandas as pd
mtcars = pd.read_csv("mtcars.csv") # Assume the file exists
heavy_cars = mtcars[mtcars['wt'] > 4]
mean_mpg_heavy = heavy_cars['mpg'].mean()
print(mean_mpg_heavy)

This works, but it has a hidden, dangerous property: state. The script relies on variables like heavy_cars existing in the environment, making the code hard to reason about, debug, and test. If scripting with state is a crack in the foundation of reproducibility, then using computational notebooks is a gaping hole.

Notebooks like Jupyter introduce an even more insidious form of state: the cell execution order. You can execute cells out of order, meaning the visual layout of your code has no relation to how it actually ran. This is a recipe for non-reproducible results and a primary cause of the “it worked yesterday, why is it broken today?” problem.

The solution to this chaos is to embrace a paradigm that minimizes state: Functional Programming (FP). Instead of a linear script, we structure our code as a collection of self-contained, predictable functions. To support this, we will work exclusively in plain text files (.R, .py), which enforce a predictable, top-to-bottom execution, and use literate programming (using Quarto). The power of FP comes from the concept of purity, borrowed from mathematics. A mathematical function has a beautiful property: for a given input, it always returns the same output. sqrt(4) is always 2. Its result doesn’t depend on what you calculated before or on a random internet connection. Our Nix environments handle the “right library” problem; purity handles the “right logic” problem. Our goal is to write our analysis code with this same level of rock-solid predictability.

To appreciate what FP brings, it helps to contrast it with object-oriented programming (OOP) which is arguably the dominant paradigm in many software systems. OOP grew out of large industrial projects where different teams or departments managed different parts of a system. Each team could define the behavior of its component (an “object”) and expose a limited interface to the rest of the organization. The core idea was message passing: you send a message to an object, asking it to perform an action, without needing to know how it works internally. OOP organizes computation around who does what, a network of objects communicating with each other and managing their own internal state.

Functional programming, by contrast, organizes computation around how data changes. It replaces a network of interacting objects with a flow of transformations: data goes in, data comes out, and nothing else changes in the process. Rather than objects that carry both data and behavior, FP favors pure functions that take data as input and return new data as output. The emphasis shifts from actors to transformations — from modeling the world as a set of entities that act, to modeling it as a sequence of transformations applied to values.

This shift is especially powerful in data science. Analyses are naturally expressed as pipelines of transformations (cleaning, filtering, aggregating, modeling) where each step should depend only on its input, not on hidden state elsewhere in the program. Functional programming aligns perfectly with this workflow. Pure functions make results reproducible, because the same inputs always yield the same outputs. Immutability prevents accidental side effects on shared data. And because transformations can be composed, tested, and reused independently, FP encourages modular, maintainable analysis code. Finally, because pure functions are self-contained, they can be easily parallelized, making it easier to scale analyses across datasets or machines.

In short, OOP helps us build complex systems by modeling actors that exchange messages, while FP helps us build robust analyses by composing transformations of data. For data science, where clarity, reproducibility, and data flow matter most, the functional approach offers the stronger foundation.

3.1.1 Why Does This Matter for Data Science?

Adopting a functional style brings massive benefits that directly connect to our previous chapters:

Unit Testing is Now Possible: You can’t easily test a 200-line script. But you can easily test a small function that does one thing. Does calculate_mean_mpg(data) return the correct value for a sample dataset? This makes your code more reliable.
Code Review is Easier (Git Workflow): As we saw in the Git chapter, reviewing a small, self-contained change is much easier than reviewing a giant, sprawling one. A Pull Request that just adds or modifies a single function is simple for your collaborators to understand and approve.
Working with LLMs is More Effective: It’s difficult to ask an LLM to “fix my 500-line analysis script.” It’s incredibly effective to ask, “Write a Python function that takes a pandas DataFrame and a column name, and returns the mean of that column, handling missing values. Also, write three pytest unit tests for it.” Functions provide the clear boundaries and contracts that LLMs excel at working with.
Readability and Maintainability: Well-named functions are self-documenting.
```
starwars %>% 
  group_by(species) %>% 
  summarize(mean_height = mean(height))
```
is instantly understandable. The equivalent for loop is a puzzle you have to solve.

3.2 Purity and Side Effects

A pure function has two rules:

It only depends on its inputs. It doesn’t use any “global” variables defined outside the function.
It doesn’t change anything outside of its own scope. It doesn’t modify a global variable or write a file to disk. This is called having “no side effects.”

Consider this “impure” function in Python:

# IMPURE: Relies on a global variable
discount_rate = 0.10

def calculate_discounted_price(price):
  return price * (1 - discount_rate) # What if discount_rate changes?

print(calculate_discounted_price(100))
# > 90.0
discount_rate = 0.20 # Someone changes the state
print(calculate_discounted_price(100))
# > 80.0  -- Same input, different output!

The pure version passes all its dependencies as arguments:

# PURE: All inputs are explicit arguments
def calculate_discounted_price_pure(price, rate):
  return price * (1 - rate)

print(calculate_discounted_price_pure(100, 0.10))
# > 90.0
print(calculate_discounted_price_pure(100, 0.20))
# > 80.0

Now the function is predictable and self-contained.

3.2.1 Handling “Impure” Operations like Randomness

Some operations, like generating random numbers, are inherently impure. Each time you run rnorm(10) or numpy.random.rand(10), you get a different result.

The functional approach is not to avoid this, but to control it by making the source of impurity (the random seed) an explicit input.

In R, the {withr} package helps create a temporary, controlled context:

library(withr)

# This function is now pure! For a given seed, the output is always the same.
pure_rnorm <- function(n, seed) {
  with_seed(seed, {
    rnorm(n)
  })
}

pure_rnorm(n = 5, seed = 123)
pure_rnorm(n = 5, seed = 123)

In Python, numpy provides a more modern, object-oriented way to handle this, which is naturally functional:

import numpy as np

# Create a random number generator instance with a seed
rng = np.random.default_rng(seed=123)

# Now, calls on this 'rng' object are deterministic within its context
print(rng.standard_normal(5))

# If we re-create the same generator, we get the same numbers
rng2 = np.random.default_rng(seed=123)
print(rng2.standard_normal(5))

The key is the same: the “state” (the seed) is explicitly managed, not hidden globally.

However, this introduces a concept from another programming paradigm: Object-Oriented Programming (OOP). The rng variable is not just a value; it’s an object that bundles together data (its internal seed state) and methods that operate on that data (.standard_normal()). This is called encapsulation. This is a double-edged sword for reproducibility. On one hand, it’s a huge improvement over hidden global state. On the other, the rng object itself is now a stateful entity. If we called rng.standard_normal(5) a second time, it would produce different numbers because its internal state would have been mutated by the first call.

In a purely functional world, we would avoid creating such stateful objects. However, in the pragmatic world of Python data science, this is often unavoidable. Core libraries like pandas, scikit-learn, and matplotlib are fundamentally object-oriented. You create DataFrame objects, model objects, and plot objects, all of which encapsulate state. Our guiding principle, therefore, must be one of careful management: use functions for the flow and logic of your analysis, and treat objects from libraries as values that are passed between these functions. Avoid building your own complex classes with hidden state for your data pipeline. A pipeline composed of functions (df2 = clean_data(df1); df3 = analyze_data(df2)) is almost always more transparent and reproducible than an object-oriented one (pipeline.load(); pipeline.clean(); pipeline.analyze()).

3.2.2 Can We Make This Truly Pure?

This naturally raises this next question: can we force this numpy example to be truly pure? A pure function cannot have side effects, which means it cannot mutate the rng object’s internal state. To achieve this, our function must take the generator’s current state as an explicit input and return a tuple containing both the desired random numbers and the new, updated state of the generator.

Let’s write a wrapper function that does exactly this:

import numpy as np

def pure_standard_normal(generator_state, n_samples):
    """
    A pure function to generate standard normal random numbers.

    Args:
        generator_state: The current state of a numpy BitGenerator.
        n_samples: The number of samples to generate.

    Returns:
        A tuple containing (random_numbers, new_generator_state).
    """
    # 1. Create a temporary generator instance from the input state
    temp_rng = np.random.Generator(np.random.PCG64(generator_state))

    # 2. Generate the numbers (this mutates the *temporary* generator)
    numbers = temp_rng.standard_normal(n_samples)

    # 3. Extract the new state from the temporary generator
    new_state = temp_rng.bit_generator.state

    # 4. Return both the result and the new state
    return (numbers, new_state)

# --- How to use this pure function ---

# 1. Get an initial state from a seed
initial_state = np.random.PCG64(123).state

# 2. First call: provide the state, get back numbers and a *new* state
first_numbers, state_after_first_call = pure_standard_normal(initial_state, 5)
print("First call results:", first_numbers)

# 3. Second call: MUST use the new state from the previous call
second_numbers, state_after_second_call = pure_standard_normal(state_after_first_call, 5)
print("Second call results:", second_numbers)

# Proof of purity: If we re-use the initial state, we get the exact same "first" result
proof_numbers, _ = pure_standard_normal(initial_state, 5)
print("Proof call results:", proof_numbers)

As you can see, this is now 100% pure and predictable. The function pure_standard_normal will always produce the same output tuple for the same input tuple.

3.2.2.1 Is This Feasible in Practice?

While this is a powerful demonstration of functional principles, it is often not practical for day-to-day data science in Python. Manually passing the state variable from one function to the next throughout an entire analysis script (state_1, state_2, state_3…) would be extremely verbose and cumbersome.

The key takeaway is understanding the trade-off. The object-oriented approach (rng = np.random.default_rng(seed=123)) is a pragmatic compromise. It encapsulates the state in a predictable way, which is a vast improvement over hidden global state, even if it’s not technically “pure”. If you have to use Python: treat stateful library objects like rng as values that are created once with a fixed seed and passed into your pure analysis functions. This gives you 99% of the benefit of reproducibility with a fraction of the complexity.

This difference in the “feel” of functional composition between R’s pipe and Python’s method chaining is no accident; it reflects the deep-seated design philosophies of each language. This context is crucial for understanding why certain patterns feel more “natural” in each environment. R’s lineage traces back to the S language, which was itself heavily influenced by Scheme, a dialect of Lisp and a bastion of functional programming. Consequently, treating data operations as a series of function transformations is baked into R’s DNA. The entire Tidyverse ecosystem, with its ubiquitous pipe, is a modern implementation of this functional heritage.

Python, in contrast, was designed with a different set of priorities, famously summarized in its Zen: “There should be one—and preferably only one—obvious way to do it.” Its creator, Guido van Rossum, historically argued that explicit for loops and list comprehensions were more readable and “Pythonic” than functional constructs like map and lambda. He was so committed to this principle of one clear path that he even proposed removing these functions from the language entirely at one point.

R is fundamentally a functional language that has acquired object-oriented features, while Python is a quintessential object-oriented language with powerful functional capabilities. Recognizing this history helps explain why a chain of functions feels native in R, while method chaining on objects is the default in pandas and polars. My goal in this course is for you to master the functional paradigm so you can apply it effectively in either language, leveraging the native strengths of each.

3.3 Writing Your Own Functions

Let’s learn the syntax. The goal is always to encapsulate a single, logical piece of work.

3.3.0.1 In R

R functions are first-class citizens. You can assign them to variables and pass them to other functions.

# A simple function
calculate_ci <- function(x, level = 0.95) {
  # Calculate the mean and standard error
  se <- sd(x, na.rm = TRUE) / sqrt(length(x))
  mean_val <- mean(x, na.rm = TRUE)

  # Calculate the confidence interval bounds
  alpha <- 1 - level
  lower <- mean_val - qnorm(1 - alpha/2) * se
  upper <- mean_val + qnorm(1 - alpha/2) * se

  # Return a named vector
  # the `return()` statement is not needed at the end
  # but can be useful for early returning a result
  c(mean = mean_val, lower = lower, upper = upper)
}

# Use it
data <- c(1.2, 1.5, 1.8, 1.3, 1.6, 1.7)
calculate_ci(data)

For data analysis, you’ll often want to write functions that work with data frames and column names. The {dplyr} package uses a special technique called “tidy evaluation” for this.

library(dplyr)

# A function that summarizes a column in a dataset
summarize_variable <- function(dataset, var_to_summarize) {
  dataset %>%
    summarise(
      n = n(),
      mean = mean({{ var_to_summarize }}, na.rm = TRUE),
      sd = sd({{ var_to_summarize }}, na.rm = TRUE)
    )
}

# The {{ }} (curly-curly) syntax tells dplyr to use the column name
# passed into the function.
starwars %>%
  group_by(species) %>%
  summarize_variable(height)

This is incredibly powerful for creating reusable analysis snippets. To learn more, read about programming with {dplyr} here.

3.3.0.2 In Python

Python’s syntax is similar, using the def keyword. Type hints are a best practice for clarity.

import pandas as pd
import numpy as np

# A function to summarize a column in a DataFrame
def summarize_variable_py(dataset: pd.DataFrame, var_to_summarize: str) -> pd.DataFrame:
  """Calculates summary statistics for a given column."""
  summary = dataset.groupby('species').agg(
      n=(var_to_summarize, 'size'),
      mean=(var_to_summarize, 'mean'),
      sd=(var_to_summarize, 'std')
  ).reset_index()
  return summary

# Load data (assuming starwars.csv exists)
# starwars_py = pd.read_csv("starwars.csv")
# summarize_variable_py(starwars_py, 'height')

3.4 The Functional Toolkit: Map, Filter, and Reduce

Once you start thinking in functions, you’ll notice common patterns emerge. Most for loops can be replaced by one of three core functional concepts: mapping, filtering, or reducing. These operations are handled by “higher-order functions”—functions that take other functions as arguments. Mastering them is key to writing elegant, declarative code.

3.4.1 1. Mapping: Applying a Function to Each Element

The pattern: You have a list of things, and you want to perform the same action on each element, producing a new list of the same length.

This is the most common replacement for a for loop. Instead of manually iterating and storing results, you just state your intent: “map this function over this list.”

3.4.1.1 In R with `purrr::map()`

The {purrr} package is the gold standard for functional programming in R. The map() family is its workhorse.

map(): Always returns a list.
map_dbl(): Returns a vector of doubles (numeric).
map_chr(): Returns a vector of characters (strings).
map_lgl(): Returns a vector of logicals (booleans).
map_dfr(): Returns a data frame by row-binding the results.

In base R, we have lapply(), vapply(), apply(), but the {purrr} functions provide a more homogenous interface.

Example: Calculate the mean of every column in a data frame.

library(purrr)

# The classic for-loop way (verbose and clunky)
# Allocate an empty vector with the right size
means_loop <- vector("double", ncol(mtcars))

for (i in seq_along(mtcars)) {
  means_loop[[i]] <- mean(mtcars[[i]], na.rm = TRUE)
}

print(means_loop)

# The functional way with map_dbl()
means_functional <- map_dbl(mtcars, mean, na.rm = TRUE)

print(means_functional)

The map() version is not just shorter; it’s safer. You can’t make an off-by-one error, and you don’t have to pre-allocate means_loop. The code clearly states its purpose.

3.4.1.2 In Python with List Comprehensions and `map()`

Python’s most idiomatic tool for mapping is the list comprehension, which we saw earlier. It’s concise and highly readable.

numbers = [1, 2, 3, 4, 5]
squares = [n**2 for n in numbers]
# > [1, 4, 9, 16, 25]

Python also has a built-in map() function, which returns a “map object” (an iterator). You usually wrap it in list() to see the results. It’s most useful when you already have a function defined.

def to_upper_case(s: str) -> str:
    return s.upper()

words = ["hello", "world"]
upper_words = list(map(to_upper_case, words))
# > ['HELLO', 'WORLD']

3.4.2 2. Filtering: Keeping Elements That Match a Condition

The pattern: You have a list of things, and you want to keep only the elements that satisfy a certain condition. The condition is defined by a function that returns TRUE or FALSE.

3.4.2.1 In R with `purrr::keep()` or `purrr::discard()`

keep() retains elements where the function returns TRUE. discard() does the opposite. The base function is Filter().

Example: From a list of data frames, keep only the ones with more than 100 rows.

# setup: create a list of data frames
df1 <- data.frame(x = 1:50)
df2 <- data.frame(x = 1:200)
df3 <- data.frame(x = 1:75)
list_of_dfs <- list(a = df1, b = df2, c = df3)

# The functional way to filter the list
large_dfs <- keep(list_of_dfs, ~ nrow(.x) > 100)
print(names(large_dfs))

3.4.2.2 In Python with List Comprehensions

List comprehensions have a built-in if clause that makes filtering incredibly natural.

numbers = [1, 10, 5, 20, 15, 30]

# Keep only numbers greater than 10
large_numbers = [n for n in numbers if n > 10]
# > [20, 15, 30]

Python also has a built-in filter() function, which, like map(), returns an iterator.

def is_even(n: int) -> bool:
    return n % 2 == 0

numbers = [1, 2, 3, 4, 5, 6]
even_numbers = list(filter(is_even, numbers))
# > [2, 4, 6]

3.4.3 3. Reducing: Combining All Elements into a Single Value

The pattern: You have a list of things, and you want to iteratively combine them into a single summary value. You start with an initial value and repeatedly apply a function that takes the “current total” and the “next element.”

This is the most complex of the three but is powerful for things like summing, finding intersections, or joining a list of data frames.

3.4.3.1 In R with `purrr::reduce()`

Example: Find the total sum of a vector of numbers.

# reduce() will take the first two elements (1, 2), apply `+` to get 3.
# Then it takes the result (3) and the next element (3), applies `+` to get 6.
# And so on.
total_sum <- reduce(c(1, 2, 3, 4, 5), `+`)

# This is equivalent to 1 + 2 + 3 + 4 + 5
print(total_sum)

The base R function is called Reduce().

A more practical data science example: find all the column names that are common to a list of data frames.

# Get the column names of each df in the list
list_of_colnames <- map(list_of_dfs, names)
print(list_of_colnames)

# Use reduce with the `intersect` function to find common elements
common_cols <- reduce(list_of_colnames, intersect)
print(common_cols)

3.4.3.2 In Python with `functools.reduce`

The reduce function was moved out of the built-ins and into the functools module in Python 3 because it’s often less readable than an explicit for loop for simple operations like summing (well, at least according to Python users…). However, it’s still the right tool for more complex iterative combinations.

from functools import reduce
import operator

numbers = [1, 2, 3, 4, 5]

# Use reduce with the addition operator to sum the list
total_sum_py = reduce(operator.add, numbers)
# > 15

# You can also use a lambda function
total_product = reduce(lambda x, y: x * y, numbers)
# > 120

3.5 The Power of Composition

The final, beautiful consequence of a functional style is composition. You can chain functions together to build complex workflows from simple, reusable parts. This is exactly what the pipe operators (|> in R, %>% from {magrittr}) and method chaining (the . in pandas) are designed for.

This R code is a sequence of function compositions:

starwars %>%
  filter(!is.na(mass)) %>%
  select(species, sex, mass) %>%
  group_by(sex, species) %>%
  summarise(mean_mass = mean(mass), .groups = "drop")

This is equivalent to summarise(group_by(select(filter(starwars, ...)))). The pipe makes it readable.

The same idea applies in Python with pandas:

# (starwars_py
#  .dropna(subset=['mass'])
#  .filter(items=['species', 'sex', 'mass'])
#  .groupby(['sex', 'species'])
#  ['mass'].mean()
#  .reset_index()
# )

The issue with method chaining though, is that this only works within the methods that are available for pandas.DataFrame objects. You could apply another function using pandas.DataFrame.apply() function, but you can’t pipe functions from different packages like you could in R (more on this in the next subsection).

Each step is a function that takes a data frame and returns a new, transformed data frame. By combining map, filter, and reduce with this compositional style, you can express complex data manipulation pipelines without writing a single for loop. This makes your code more declarative, less prone to bugs, and easier to reason about—a perfect fit for a reproducible workflow.

3.5.1 The Challenge of Composition in Python

The difficulty of function composition in Python, as you’ve noted, stems from fundamental differences in its primary programming paradigm compared to R. While both languages are powerful and versatile, their core designs influence how naturally they support the seamless chaining of functions. The crux of the matter lies in Python’s primarily encapsulated Object-Oriented Programming (OOP) versus R’s functional OOP and its use of polymorphic functions.

3.5.1.1 Python’s Encapsulated OOP: Methods Belong to Objects

As already mention earlier in the chapter, Python is predominantly an object-oriented language where data and the functions that operate on that data are bundled together into objects. This concept is known as encapsulation. A class defines the blueprint for an object, and the functions defined within a class are called methods. These methods are intrinsically tied to the object’s class and are typically invoked using dot notation (.), as seen in the pandas example.

This tight coupling of methods to specific object types is the main reason why fluid composition can be challenging. Method chaining, while elegant, is limited to the methods that have been explicitly defined for a particular class. To apply a function from a different library or a user-defined function that isn’t a method of the object, you often need to use workarounds like apply() in pandas, which can break the intuitive flow of a composition chain.

Furthermore, while Python supports functional programming concepts, they are not always as central to the language’s design as its OOP features. For instance, lambda functions in Python are limited to a single expression, which can make defining complex on-the-fly functions cumbersome.

3.5.1.2 R’s Functional OOP: Functions are Polymorphic and Independent

In contrast, R was designed with a strong emphasis on functional programming. Its approach to object-orientation is described as functional OOP. In this paradigm, methods are not encapsulated within class definitions. Instead, functions are often “generic,” meaning they can behave differently depending on the class of the object passed to them. This is a form of polymorphism.

This design choice has a profound impact on composition. Because functions are not strictly owned by objects, they can be more easily and flexibly combined. The pipe operators in R (|> and %>%) are a testament to this, allowing for the creation of highly readable and complex data manipulation pipelines by passing data through a series of independent functions. Each function takes data as an input and returns a transformed version, which is then passed to the next function in the chain.

3.5.1.3 The Core Distinction: “Has-a” vs. “Is-a” and its Impact on Composition

The principle of “favor composition over inheritance” is a well-known software design guideline. Inheritance models an “is-a” relationship (a Dog is an Animal), while composition models a “has-a” relationship (a Car has an Engine).

Python’s encapsulated OOP often encourages the use of inheritance, where a class inherits methods from a parent class. While powerful, this can lead to rigid hierarchies.
R’s functional approach naturally favors a compositional style. You build complex operations by combining simpler, single-purpose functions. This aligns well with the “has-a” model, where a data analysis pipeline “has a” filtering step, a selection step, and a summarization step.

In essence, Python’s strength lies in creating well-defined, encapsulated objects with specific behaviors. R’s strength, particularly in data analysis, is in its ability to fluidly combine and apply functions to data. This makes the “compositional style” a more natural fit for the R ecosystem. While Python can achieve similar results, it often requires more deliberate effort to break out of the strict object-method paradigm to achieve the same level of compositional freedom.

3.6 Conclusion: Functions as the Bedrock of Reproducibility

This chapter has laid the groundwork for the most critical pillar of reproducible data science: writing reproducible code. We have moved beyond the ephemeral, state-dependent nature of scripts and computational notebooks to embrace the discipline and predictability of Functional Programming.

By treating functions as our primary unit of work, we unlock a cascade of benefits. Pure functions, which guarantee the same output for the same input, form the core of this approach. They are transparent, easy to reason about, and, most importantly, testable. When we encounter inherently “impure” operations like random number generation, we’ve learned to control the impurity by making the source of impurity (in this example the random seed) an explicit and managed input, rather than a hidden global state.

We’ve replaced verbose and error-prone for loops with the powerful functional trio of map, filter, and reduce. These higher-order functions allow us to express complex data manipulations declaratively, stating what we want to do rather than detailing how to do it. This leads to code that is not only more concise but also less prone to bugs.

Finally, we explored composition, the elegant chaining of these simple functions to build sophisticated analysis pipelines. We saw how this concept manifests differently in R and Python, a direct reflection of their core design philosophies. R’s functional heritage makes composition via the pipe (%>% or |>) a natural and seamless experience. Python’s object-oriented nature favors method chaining on objects like pandas DataFrames, a powerful but more constrained form of composition.

Understanding this distinction is key to becoming an effective data scientist in any language. By mastering the functional paradigm, you are not just learning a new coding style; you are adopting a new way of thinking. You are building a foundation for code that is robust, easy to review, simple to debug, and truly reproducible—the ultimate goal of any serious analytical project.

3.7 Exercises

The following exercises will help you solidify your understanding of functional programming concepts in both R and Python. Use built-in datasets like iris or mtcars for R, and you can load them into pandas DataFrames for the Python exercises.

3.7.1 1. From Impure to Pure

Goal: Refactor a function that relies on global state into a pure function.

R: The following function filters mtcars to find cars with a miles-per-gallon (MPG) above a certain threshold, but the threshold is a global variable. Rewrite it so that it becomes a pure function.

# Impure function
mpg_threshold <- 20

get_efficient_cars_impure <- function(dataset) {
  dplyr::filter(dataset, mpg > mpg_threshold)
}

# Your task: Create a pure function `get_efficient_cars_pure`
# that takes the dataset and the threshold as arguments.
# Then, call it with a threshold of 25.

Python: Do the same for the Python equivalent. Rewrite the impure function into a pure one (you can find the mtcars dataset as a csv here.

import pandas as pd
mtcars = pd.read_csv("path/to/mtcars.csv")

# Impure function
mpg_threshold <- 20

def get_efficient_cars_impure(df):
    return df[df['mpg'] > mpg_threshold]

# Your task: Create a pure function `get_efficient_cars_pure`
# that takes the DataFrame and the threshold as arguments.
# Then, call it with a threshold of 25.

3.7.2 2. Mapping

Goal: Use mapping to apply a function to multiple elements of a list or data frame.

R: Using the iris dataset, calculate the number of distinct values for each of its four numeric columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width).
- Hint: Use purrr::map_int() and the n_distinct() function from dplyr.

Python: You are given a list of strings. Use a list comprehension to create a new list containing the length of each string.

words = ["functional", "programming", "is", "powerful"]
# Your task: create a list `word_lengths` containing [10, 11, 2, 8]

3.7.2.1 3. Filtering

Goal: Use filtering to select elements from a list based on a condition.

R: You have a list of vectors. Use purrr::keep() to select only the vectors whose mean is greater than 5.

list_of_vectors <- list(
  a = c(1, 2, 9),
  b = c(8, 8, 9),
  c = c(1, 1, 2)
)
# Your task: create a list `high_mean_vectors` that contains only vectors a and b.

Python: You have a list of dictionaries, where each dictionary represents a product. Use a list comprehension with an if clause to filter for products that are on sale.

products = [
    {'name': 'Laptop', 'price': 1200, 'on_sale': False},
    {'name': 'Mouse', 'price': 25, 'on_sale': True},
    {'name': 'Keyboard', 'price': 75, 'on_sale': True}
]
# Your task: create a list `sale_items` containing only the mouse and keyboard dicts.

3.7.2.2 4. Reducing

Goal: Use a reduce operation to aggregate a list into a single value.

R: You are given three small data frames. Use purrr::reduce() and a dplyr::full_join() to combine them into a single data frame.

df1 <- data.frame(id = c("a", "b"), val1 = c(1, 2))
df2 <- data.frame(id = c("a", "c"), val2 = c(3, 4))
df3 <- data.frame(id = c("b", "c"), val3 = c(5, 6))
list_of_dfs <- list(df1, df2, df3)

# Your task: use reduce to join them all by the 'id' column.

Python: Given a list of lists (a 2D matrix), use functools.reduce to “flatten” it into a single list.

from functools import reduce
matrix = [[1, 2, 3], [4, 5], [6]]
# Your task: use reduce to produce the list [1, 2, 3, 4, 5, 6]
# Hint: `operator.add` can concatenate lists.

3.7.2.3 5. Composition Challenge: A Reusable Analysis Function

Goal: Write a reusable function that encapsulates a common data analysis task.

R: Write a single function named summarize_by_group that takes three arguments: a data frame (dataset), a categorical column to group by (grouping_var), and a numeric column to summarize (summary_var). The function should return a summarized data frame with the count, mean, and standard deviation for the summary_var within each group. Use {dplyr} and the { } syntax. Test it on the iris dataset by summarizing Sepal.Length grouped by Species.
Python: Write the equivalent function in Python named summarize_by_group_py. It should take a pandas DataFrame, a grouping_var name (string), and a summary_var name (string). Use .groupby() and .agg() to produce the same summary table (count, mean, sd). Test it on the penguins dataset by summarizing body_mass_g grouped by species.