×
Portfolio Blog Sahaja Contact
Tony O'Rourke

Row by row tibble processing with purrr

An unexpected challenge

After becoming familiar with R and the tidyverse and enjoying its power and expressiveness, one naturally wants to expand the use of R to more general purpose programming tasks. Once the data is in tibble format, surely there is a handy helper function that allows processing each row -- for example, perform a sql update, or perform a web service lookup -- on each row. This is such a common pattern that one assumes that it is a pretty simple task for the mighty R tidyverse.

Let's imagine we have a tibble of employees that looks like this:

    
        df.employees <- tribble (
            ~id,    ~first_name, ~last_name,  ~dob,        ~gender, ~title,
            '101A', 'Bob',      'Francis',   '1983-06-12', 'M',  'Director',
            '102C', 'Susan',    'Bluebell',  '1990-04-21', 'F',  'Assistent Director',
            '201C', 'Emily',    'Rosen',     '1971-11-07', 'F',  'CTO',
            '301X', 'Ashley',   'Emerson',   '2001-01-25', 'F',  'CFO'
        )
    

First let's go with an intuitive but naive first approach on iterating. Pipe the employees to an an anonymous function using purrr's map function. The function itself just outputs the parameter.

    
        df.employees |>
           purrr::map(function(x) {
               x
        })
    

The output might not be what one would expect coming from a general programming background. The output is as follows:

    
        $id
        [1] "101A" "102C" "201C" "301X"
        
        $first_name
        [1] "Bob"    "Susan"  "Emily"  "Ashley"
        
        $last_name
        [1] "Francis"  "Bluebell" "Rosen"    "Emerson" 
        
        $dob
        [1] "1983-06-12" "1990-04-21" "1971-11-07" "2001-01-25"
        
        $gender
        [1] "M" "F" "F" "F"
        
        $title
        [1] "Director"           "Assistent Director" "CTO"                "CFO"  
            
    

Instead of iterating over the rows it iterated over the columns! Indeed, if we look at the data type of the tibble:

    
        typeof(df.employees)
        [1] "list"
    

We see that a tibble is essentially a list (albeit of columns).

    
        typeof(df.employees)
         [1] "list"
    
    
        str(df.employees)
        tibble [4 × 6] (S3: tbl_df/tbl/data.frame)
        $ id        : chr [1:4] "101A" "102C" "201C" "301X"
        $ first_name: chr [1:4] "Bob" "Susan" "Emily" "Ashley"
        $ last_name : chr [1:4] "Francis" "Bluebell" "Rosen" "Emerson"
        $ dob       : chr [1:4] "1983-06-12" "1990-04-21" "1971-11-07" "2001-01-25"
        $ gender    : chr [1:4] "M" "F" "F" "F"
        $ title     : chr [1:4] "Director" "Assistent Director" "CTO" "CFO"
    

So map iterated over the list but not in the row by manner we were hoping for. Surely the mighty tidyverse has a way an elegant way of handling this.

Welcome 'pmap' a map variant to the rescue

If we look in the documentation for map, we may notice the variant pmap or parallel map, which, has a special behavior when passed a dataframe-- it iterates row by row:

    A data frame is an important special case of .l. It will cause .f to be called once for each row.

This sounds like it is exactly what we want.

    
        df.employees |>
        purrr::pmap(function(...) {
            tb.row <- tibble(...)
            tb.row
        })
        

Each row becomes a tibble that we are able to access any column we want.

    
        df.employees |>
        purrr::pmap(function(...) {
            tb.row <- tibble(...)
            tb.row$IsFemale <- tb$gender == 'F'
            tb.row
        }) %>%
        bind_rows()
    

We can pipe the output into bind_rows() so we get one tibble as we had originally. We have appended a new column that can enrich the dataset with the result of our row by row processing, if desired.

    
        # A tibble: 4 × 7
  id    first_name last_name dob        gender title              IsFemale
                                       
1 101A  Bob        Francis   1983-06-12 M      Director           FALSE   
2 102C  Susan      Bluebell  1990-04-21 F      Assistent Director TRUE    
3 201C  Emily      Rosen     1971-11-07 F      CTO                TRUE    
4 301X  Ashley     Emerson   2001-01-25 F      CFO                TRUE    
    

This is precisely the kind of row by row iteration that I was hoping for and syntax is pretty clean.

Conclusion

With a little digging and experimentation one can use R and the tidyverse as a general purpose language and reap a lot of its benefits: expressiveness and conciseness in getting a lot of work done without much code. One of R's charms is that it doesn't always do things the way one would expect, but that isn't necessarily a bad thing.

While investigating this topic I've found there are a lot possible variations on this theme.