Most of the information provided by the GitHub API is well-suited for a tabular data structure. Entities like issues and milestones can easily be represented as a row of a dataset, and characteristics about them (e.g. date created, open / closed status) each fit one column.
The main exception to this rule is assignees and labels as these potentially have a “one-to-many” relationship with issues. (That is, one issue can have multiple assignees or multiple labels.) One way this could have been represented is to create separate key tables for issues-assignees and issues-milestones. However, this seemed like a bulky solution that would result in many unneccesary joins and API calls.
Instead, parse_issues()
uses list
columns to represent assignees and labels. Essentially, the
labels_name
and assignees_name
columns do not
contain character values but instead, each row contains its own
list of values. This may seem unintuitive at first, but list
columns are a neat way to rectangularize nested data structures.
However, wrangling list columns is slightly different than wrangling
values due to the extra level of containment. To help effectively work
with these list columns, projmgr
offers three helper
functions: listcol_extract
, listcol_filter
,
and listcol_pivot
. These are particularly well-suited to
help in cases where labels are used to encode issue metadata in some
sort of key-value pair. For example, one might use a format like
"priority:high"
to denote importance or
"blue-team"
to denote responsibility.
To see these in action, we will take a snapshot of issues from the RForwards repo.
forwards <- create_repo_ref("forwards", "tasks") issues <- get_issues(forwards) %>% parse_issues()
As we can see, the labels_name
column contains lists of
entries.
library(dplyr) select(issues, labels_name, number, title) %>% head() #> labels_name number #> 1 survey-team 41 #> 2 help wanted, survey-team 40 #> 3 35 #> 4 33 #> 5 32 #> 6 31 #> title #> 1 useR! 2018 survey analysis #> 2 Create new Community section on Data page #> 3 Guidelines on Ableist language in Talks and Presentations. #> 4 Rainbow R : LGBT+ in the R Community #> 5 Inviting R community and event organizers from Africa and Asia to the RUG slack. #> 6 Joint event with Trans*Code
One common use of labels in this repo is to denote the team
responsible for completing a task, denoted by the tag
"{name}-team"
.
unique(unlist(issues$labels_name)) #> [1] "survey-team" "help wanted" "conferences-team" #> [4] "admin" "on-ramps-team" "branding" #> [7] "teaching-team" "social-media-team" "community-team"
The listcol_filter()
lets us filter our data only to the
isues relevant to a certain list column entry. For example, the data
currently contains 26 issues.
nrow(issues) #> [1] 26
If we are only interested in issues that have been designated for a certain task force, we can filter to those ending with “-team”.
listcol_filter(issues, "labels_name", matches = "-team$", is_regex = TRUE) %>% nrow() #> [1] 14
Even more specifically, if we are members of the teaching team and want to find issues we are responsible for, we can search for an exact match.
listcol_filter(issues, "labels_name", matches = "teaching-team") %>% nrow() #> [1] 2
The listcol_extract()
function creates a new column in
the data by checking each element of the list column for a certain
structure. For example, we can create a team
column in our
dataset by extracting the labels ending in "-team"
.
select(issues, labels_name, number) %>% listcol_extract("labels_name", regex = "-team$") %>% head() #> labels_name number team #> 1 survey-team 41 survey #> 2 help wanted, survey-team 40 survey #> 3 35 <NA> #> 4 33 <NA> #> 5 32 <NA> #> 6 31 <NA>
By default, the function names the new column a “cleaned-up” form of
the regex used for matching, but this can be overridden with the
new_col_name
argument.
select(issues, labels_name, number) %>% listcol_extract("labels_name", regex = "-team$", new_col_name = "team_name") %>% head() #> labels_name number team_name #> 1 survey-team 41 survey #> 2 help wanted, survey-team 40 survey #> 3 35 <NA> #> 4 33 <NA> #> 5 32 <NA> #> 6 31 <NA>
By default, the function also drops the regex from the values. This
is controlled by the keep_regex
argument.
select(issues, labels_name, number) %>% listcol_extract("labels_name", regex = "-team$", keep_regex = TRUE) %>% head() #> labels_name number team #> 1 survey-team 41 survey-team #> 2 help wanted, survey-team 40 survey-team #> 3 35 <NA> #> 4 33 <NA> #> 5 32 <NA> #> 6 31 <NA>
Unlike the above example, sometimes multiple items will match a given regex. In this case, a list-column is added to the dataset containing any matches.
For example, the third entry of the assignees_login
field contains 4 logins. Two contain the letter “d”.
issues$assignees_login[[3]] #> character(0)
Now, listcol_extract()
returns a list-column. For the
third entry, this list column has length 2.
select(issues, assignees_login, number) %>% listcol_extract("assignees_login", regex = "d", keep_regex = TRUE) %>% head() #> assignees_login number d #> 1 41 NA #> 2 40 NA #> 3 35 NA #> 4 33 NA #> 5 32 NA #> 6 31 NA
Finally, the listcol_pivot()
helped function identifies
all labels matching a regex, extract all the “values” from the key-value
pair, and pivots these into boolean columns. For example, the following
code makes a widened dataframe with a separate column for each team.
TRUE
denotes the fact that that team is responsible for
that issue.
issues_by_team <- select(issues, number, labels_name) %>% listcol_pivot("labels_name", regex = "-team$", transform_fx = function(x) sub("-team", "", x), delete_orig = TRUE) head(issues_by_team) #> number survey conferences on-ramps teaching social-media community #> 1 41 TRUE FALSE FALSE FALSE FALSE FALSE #> 2 40 TRUE FALSE FALSE FALSE FALSE FALSE #> 3 35 FALSE FALSE FALSE FALSE FALSE FALSE #> 4 33 FALSE FALSE FALSE FALSE FALSE FALSE #> 5 32 FALSE FALSE FALSE FALSE FALSE FALSE #> 6 31 FALSE FALSE FALSE FALSE FALSE FALSE
This has many convenient use-cases, including being able to quickly see the number falling into each category.
issues_by_team %>% select(-number) %>% summarize_all(sum) #> survey conferences on-ramps teaching social-media community #> 1 4 1 3 2 2 2
Besides these helper columns, another convenient way to work with
these list columns is by using tidyr::unnest()
to create
key tables mapping issue numbers (number
) to the label
names (labels_name
) or assignees
(assignees_login
).
library(tidyr)
For example, below we select only the issue number and the labels name columns.
issues_labels <- issues %>% select(number, labels_name) %>% unnest() #> Warning: `cols` is now required when using unnest(). #> Please use `cols = c(labels_name)` head(issues_labels) #> # A tibble: 6 x 2 #> number labels_name #> <int> <chr> #> 1 41 survey-team #> 2 40 help wanted #> 3 40 survey-team #> 4 28 conferences-team #> 5 27 admin #> 6 26 on-ramps-team
The same can be done to map between issue numbers and assignees.
issues_assignees <- issues %>% select(number, assignees_login) %>% unnest() #> Warning: `cols` is now required when using unnest(). #> Please use `cols = c(assignees_login)` head(issues_assignees) #> # A tibble: 6 x 2 #> number assignees_login #> <int> <chr> #> 1 28 hturner #> 2 25 hturner #> 3 25 emdodwell #> 4 23 emdodwell #> 5 22 emdodwell #> 6 21 hturner
Logic could then be done on these key tables to identify relevant issue numbers and joined / filtered back on to the complete issues dataframe.