Wrangling Label and Assignee List Columns

Most of the information provided by the GitHub API is well-suited for a tabular data structure. Entities like issues and milestones can easily be represented as a row of a dataset, and characteristics about them (e.g. date created, open / closed status) each fit one column.

The main exception to this rule is assignees and labels as these potentially have a “one-to-many” relationship with issues. (That is, one issue can have multiple assignees or multiple labels.) One way this could have been represented is to create separate key tables for issues-assignees and issues-milestones. However, this seemed like a bulky solution that would result in many unneccesary joins and API calls.

Instead, parse_issues() uses list columns to represent assignees and labels. Essentially, the labels_name and assignees_name columns do not contain character values but instead, each row contains its own list of values. This may seem unintuitive at first, but list columns are a neat way to rectangularize nested data structures.

However, wrangling list columns is slightly different than wrangling values due to the extra level of containment. To help effectively work with these list columns, projmgr offers three helper functions: listcol_extract, listcol_filter, and listcol_pivot. These are particularly well-suited to help in cases where labels are used to encode issue metadata in some sort of key-value pair. For example, one might use a format like "priority:high" to denote importance or "blue-team" to denote responsibility.

To see these in action, we will take a snapshot of issues from the RForwards repo.

forwards <- create_repo_ref("forwards", "tasks")
issues <- get_issues(forwards) %>% parse_issues()

As we can see, the labels_name column contains lists of entries.

library(dplyr)
select(issues, labels_name, number, title) %>% head()
#>                labels_name number
#> 1              survey-team     41
#> 2 help wanted, survey-team     40
#> 3                              35
#> 4                              33
#> 5                              32
#> 6                              31
#>                                                                              title
#> 1                                                       useR! 2018 survey analysis
#> 2                                        Create new Community section on Data page
#> 3                       Guidelines on Ableist language in Talks and Presentations.
#> 4                                             Rainbow R : LGBT+ in the R Community
#> 5 Inviting R community and event organizers from Africa and Asia to the RUG slack.
#> 6                                                      Joint event with Trans*Code

One common use of labels in this repo is to denote the team responsible for completing a task, denoted by the tag "{name}-team".

unique(unlist(issues$labels_name))
#> [1] "survey-team"       "help wanted"       "conferences-team" 
#> [4] "admin"             "on-ramps-team"     "branding"         
#> [7] "teaching-team"     "social-media-team" "community-team"

Extract List Column

The listcol_extract() function creates a new column in the data by checking each element of the list column for a certain structure. For example, we can create a team column in our dataset by extracting the labels ending in "-team".

select(issues, labels_name, number) %>%
  listcol_extract("labels_name", regex = "-team$") %>%
  head()
#>                labels_name number   team
#> 1              survey-team     41 survey
#> 2 help wanted, survey-team     40 survey
#> 3                              35   <NA>
#> 4                              33   <NA>
#> 5                              32   <NA>
#> 6                              31   <NA>

By default, the function names the new column a “cleaned-up” form of the regex used for matching, but this can be overridden with the new_col_name argument.

select(issues, labels_name, number) %>%
  listcol_extract("labels_name", regex = "-team$", new_col_name = "team_name") %>%
  head()
#>                labels_name number team_name
#> 1              survey-team     41    survey
#> 2 help wanted, survey-team     40    survey
#> 3                              35      <NA>
#> 4                              33      <NA>
#> 5                              32      <NA>
#> 6                              31      <NA>

By default, the function also drops the regex from the values. This is controlled by the keep_regex argument.

select(issues, labels_name, number) %>%
  listcol_extract("labels_name", regex = "-team$", keep_regex = TRUE) %>%
  head()
#>                labels_name number        team
#> 1              survey-team     41 survey-team
#> 2 help wanted, survey-team     40 survey-team
#> 3                              35        <NA>
#> 4                              33        <NA>
#> 5                              32        <NA>
#> 6                              31        <NA>

Unlike the above example, sometimes multiple items will match a given regex. In this case, a list-column is added to the dataset containing any matches.

For example, the third entry of the assignees_login field contains 4 logins. Two contain the letter “d”.

issues$assignees_login[[3]]
#> character(0)

Now, listcol_extract() returns a list-column. For the third entry, this list column has length 2.

select(issues, assignees_login, number) %>%
  listcol_extract("assignees_login", regex = "d", keep_regex = TRUE) %>%
  head()
#>   assignees_login number  d
#> 1                     41 NA
#> 2                     40 NA
#> 3                     35 NA
#> 4                     33 NA
#> 5                     32 NA
#> 6                     31 NA

Pivot List Column

Finally, the listcol_pivot() helped function identifies all labels matching a regex, extract all the “values” from the key-value pair, and pivots these into boolean columns. For example, the following code makes a widened dataframe with a separate column for each team. TRUE denotes the fact that that team is responsible for that issue.

issues_by_team <-
select(issues, number, labels_name) %>%
  listcol_pivot("labels_name",
                regex = "-team$",
                transform_fx = function(x) sub("-team", "", x),
                delete_orig = TRUE)

head(issues_by_team)
#>   number survey conferences on-ramps teaching social-media community
#> 1     41   TRUE       FALSE    FALSE    FALSE        FALSE     FALSE
#> 2     40   TRUE       FALSE    FALSE    FALSE        FALSE     FALSE
#> 3     35  FALSE       FALSE    FALSE    FALSE        FALSE     FALSE
#> 4     33  FALSE       FALSE    FALSE    FALSE        FALSE     FALSE
#> 5     32  FALSE       FALSE    FALSE    FALSE        FALSE     FALSE
#> 6     31  FALSE       FALSE    FALSE    FALSE        FALSE     FALSE

This has many convenient use-cases, including being able to quickly see the number falling into each category.

issues_by_team %>% select(-number) %>% summarize_all(sum)
#>   survey conferences on-ramps teaching social-media community
#> 1      4           1        3        2            2         2

A tidyr alternative

Besides these helper columns, another convenient way to work with these list columns is by using tidyr::unnest() to create key tables mapping issue numbers (number) to the label names (labels_name) or assignees (assignees_login).

library(tidyr)

For example, below we select only the issue number and the labels name columns.

issues_labels <-
  issues %>%
  select(number, labels_name) %>%
  unnest()
#> Warning: `cols` is now required when using unnest().
#> Please use `cols = c(labels_name)`
head(issues_labels)
#> # A tibble: 6 x 2
#>   number labels_name     
#>    <int> <chr>           
#> 1     41 survey-team     
#> 2     40 help wanted     
#> 3     40 survey-team     
#> 4     28 conferences-team
#> 5     27 admin           
#> 6     26 on-ramps-team

The same can be done to map between issue numbers and assignees.

issues_assignees <-
  issues %>%
  select(number, assignees_login) %>%
  unnest()
#> Warning: `cols` is now required when using unnest().
#> Please use `cols = c(assignees_login)`
head(issues_assignees)
#> # A tibble: 6 x 2
#>   number assignees_login
#>    <int> <chr>          
#> 1     28 hturner        
#> 2     25 hturner        
#> 3     25 emdodwell      
#> 4     23 emdodwell      
#> 5     22 emdodwell      
#> 6     21 hturner

Logic could then be done on these key tables to identify relevant issue numbers and joined / filtered back on to the complete issues dataframe.

Filter List Column

Extract List Column

Pivot List Column

A tidyr alternative