Basic GitHub API Interaction • projmgr

This vignette illustrates basic projmgr codeflows.

library(projmgr)
#> Warning: package 'projmgr' was built under R version 4.0.5

Creating a Repo Reference

The first step to interacting with a GitHub repository is creating a repository reference. Repository references are “first class citizens” in projmgr and contain all authorization credentials. These references are passed as the first argument in all get_ functions and post_ functions.

Suppose, for example, we are interested in pulling data about issues in the dplyr repository. We would start off by creating a repository reference with the create_repo_ref() function.

dplyr <- create_repo_ref('tidyverse', 'dplyr')
#> Requests will authenticate with GITHUB_PAT

Note that this function has many additional parameters that you can specify as needed. For example:

if you are using GitHub Enterprise, set is_enterprise = TRUE and pass your company’s internal GitHub URL in through hostname
if you personal access tokens are stored in environment variables with a different name than GITHUB_PAT (or GITHUB_PAT_ENT when is_enterprise = TRUE), the name you used can be passed through hostname
identifier lets you provide your credentials in different ways if you haven’t set up a PAT or it is called something else (see the GitHub PAT vignette or the create_repo_ref() documentation for more details)

The check_ family of functions provide some basic validations that everything is working.

check_internet() tests for problems connecting to the internet.

check_internet()

#> [1] TRUE

check_credentials() confirms the login associated with the PAT and describes the level of access that PAT have to the specified repo.

check_credentials(dplyr)

#> -- With provided credentials -- 
#> + Login: emilyriederer
#> + Type: User
#> -- In the dplyr repo -- 
#> + Admin: FALSE
#> + Push: FALSE
#> + Pull: FALSE

check_rate_limit() checks how many more API requests you can send and when that count will reset. Note that this accepts a repo reference as a parameter only to get authentication information for an account. Limits to requests are established at the account-level not the repository level.

check_rate_limit(dplyr)

#> 4998 / 5000 (Resets at 22:26:20)

Getting Issues

get_ functions retrieve information from the GitHub API. The first argument is the repository reference and additional named query parameters can be passed subsequently.

For example, here, we request issues of the first milestone with either an open or closed state. (Note that, in keeping with the GitHub API, only open issues are returned when state is not specified. If you’re trying to build a productivity report, i.e. about everything that’s been completed, it’s very important to specify state as “closed” or “all”.)

dplyr_issues_list <- get_issues(dplyr, milestone = 1, state = 'all')

If you don’t know what parameters are available, the help_<function_name> family provides more information on valid arguments to include in get_ and post_ functions.

help_get_issues()
#> [1] "milestone" "state"     "assignee"  "creator"   "mentioned" "labels"   
#> [7] "sort"      "direction" "since"

Or take a guess. The get_ functions will check that all of your named query parameters are accepted by the API and throw an error for any that are unrecognized.

get_issues(dplyr, not_a_real_parameter = 'abc')

#> Error: The following user-inputted variables are not relevant to this API request: 
#> + not_a_real_parameter 
#> Allowed variables are: 
#> + milestone,state,assignee,creator,mentioned,labels,sort,direction,since 
#> Please remove unallowed fields and try again. 
#> Use the browse_docs() function or visit https://developer.github.com/v3/ for full API documentation.

As the get_issues() error message says, detailed documentation can also be viewed using the browse_docs() function. This function launches your browser to the appropriate part of the GitHub API documentation. For example, one might run:

browse_docs(action = 'get', object = 'issue')
#> Open URL https://developer.github.com/v3/issues/#list-issues-for-a-repository

Results are returned as a list, closely mirror the JSON output from the actual API.

str(dplyr_issues_list[[1]], max.level = 1)
#> List of 23
#>  $ url               : chr "https://api.github.com/repos/tidyverse/dplyr/issues/1229"
#>  $ repository_url    : chr "https://api.github.com/repos/tidyverse/dplyr"
#>  $ labels_url        : chr "https://api.github.com/repos/tidyverse/dplyr/issues/1229/labels{/name}"
#>  $ comments_url      : chr "https://api.github.com/repos/tidyverse/dplyr/issues/1229/comments"
#>  $ events_url        : chr "https://api.github.com/repos/tidyverse/dplyr/issues/1229/events"
#>  $ html_url          : chr "https://github.com/tidyverse/dplyr/issues/1229"
#>  $ id                : int 89591755
#>  $ node_id           : chr "MDU6SXNzdWU4OTU5MTc1NQ=="
#>  $ number            : int 1229
#>  $ title             : chr "Pass custom functions to select()"
#>  $ user              :List of 18
#>  $ labels            :List of 1
#>  $ state             : chr "closed"
#>  $ locked            : logi TRUE
#>  $ assignee          : NULL
#>  $ assignees         : list()
#>  $ milestone         :List of 16
#>  $ comments          : int 1
#>  $ created_at        : chr "2015-06-19T15:26:31Z"
#>  $ updated_at        : chr "2018-06-08T13:56:05Z"
#>  $ closed_at         : chr "2017-02-02T21:10:56Z"
#>  $ author_association: chr "NONE"
#>  $ body              : chr "Right now, you're limited in column selection when using select() and magrittr.\n\nConsider the situation where"| __truncated__

It’s likely you may prefer to work with them as dataframes instead. The parse_ family of functions converts “raw” list output from get_ into a dataframe.

dplyr_issues <- parse_issues(dplyr_issues_list)
head(dplyr_issues)
#>                                              url       id number
#> 1 https://github.com/tidyverse/dplyr/issues/1229 89591755   1229
#> 2 https://github.com/tidyverse/dplyr/issues/1183 82642129   1183
#> 3 https://github.com/tidyverse/dplyr/issues/1039 64353442   1039
#> 4  https://github.com/tidyverse/dplyr/issues/741 47540964    741
#> 5  https://github.com/tidyverse/dplyr/issues/549 40514628    549
#> 6  https://github.com/tidyverse/dplyr/issues/511 38883873    511
#>                                                                    title
#> 1                                      Pass custom functions to select()
#> 2                                           hybrid handler for quantile 
#> 3                      Should dplyr commands gobble empty last argument?
#> 4                                                      Antonym to filter
#> 5                                         equivalent of pig's ILLUSTRATE
#> 6 SQL error, when the right hand side of %in% is of length one in filter
#>      user_login user_id  state locked milestone_title milestone_id
#> 1  msjgriffiths 1093821 closed   TRUE         bluesky       491295
#> 2 matthieugomez 6223837 closed   TRUE         bluesky       491295
#> 3        rpruim  722231 closed   TRUE         bluesky       491295
#> 4 matthieugomez 6223837 closed   TRUE         bluesky       491295
#> 5       jhofman   79563 closed   TRUE         bluesky       491295
#> 6       vzemlys  320871 closed   TRUE         bluesky       491295
#>   milestone_number milestone_state milestone_created_at milestone_closed_at
#> 1                1            open           2013-11-20                <NA>
#> 2                1            open           2013-11-20                <NA>
#> 3                1            open           2013-11-20                <NA>
#> 4                1            open           2013-11-20                <NA>
#> 5                1            open           2013-11-20                <NA>
#> 6                1            open           2013-11-20                <NA>
#>   milestone_due_on n_comments created_at updated_at  closed_at
#> 1             <NA>          1 2015-06-19 2018-06-08 2017-02-02
#> 2             <NA>          4 2015-05-30 2018-06-08 2017-02-02
#> 3             <NA>          7 2015-03-25 2018-06-08 2017-03-24
#> 4             <NA>          3 2014-11-02 2018-06-08 2017-02-02
#> 5             <NA>         10 2014-08-18 2018-06-08 2017-02-02
#> 6             <NA>         16 2014-07-28 2018-06-08 2017-02-14
#>   author_association
#> 1               NONE
#> 2               NONE
#> 3               NONE
#> 4               NONE
#> 5               NONE
#> 6               NONE
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 body
#> 1                                                                                                                                                                                                                                                                                                                                                             Right now, you're limited in column selection when using select() and magrittr.\n\nConsider the situation where you have a table, and you want to select certain columns at an intermediate point in a transformation pipeline, but don't want to (i) use fixed column indexes and (ii) don't want to save out an intermediary variable and identify column names separately.\n\nIt would make sense to be able to define a custom function in the same format as `everything()` or `one_of()`, and have `select()` and `gather()` and other functions use a custom function for selection variables.\n\nIn principle the function you take in a list of column names (e.g. `names(data)`) and then output a list of integer positions (e.g. `c(1, 2, 7)`. \n\nExample usage:\n\n``` r\n\n# Pass in column that expects the first parameter to be a list of column names.\ndata %>% gather(key, value, function(vars) {\n    vars[-grep("Drop", vars, fixed = TRUE)]\n})\n\n# Define a function that expect the first parameter to be column names, second to be matches.\nnot_contains <- function(vars, match){\n    vars[-grep(match, vars, fixed = TRUE)]\n}\ndata %>% select(not_contains("avoid"))\n```\n\nLess ideally, allow custom functions to be registered as [part of `select_funs`](https://github.com/hadley/dplyr/blob/master/R/select-vars.R#L67), e.g. something like:\n\n``` r\nnot_contains <- function(vars, match){\n    vars[-grep(match, vars, fixed = TRUE)]\n}\n\ndplyr::register_select_funs(not_contains)\n\ndata %>% select(not_contains("avoid"))\n```\n
#> 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         It would be great to have an hybrid hander for quantile. For now it is orders of magnitude slower than mean:\n\n``` R\ndf <- data_frame(\n         id =  sample(1e5, 1e6, TRUE),                      \n         v =  sample(5, 1e6, TRUE)                        \n    )\ndf1 <- df %>% group_by(id)\nsystem.time(summarize(df1,  mean(v)))\n# user  system elapsed \n#  0.041   0.001   0.056 \nsystem.time(summarize(df1,  quantile(v, 0.50, type = 1)))\n#   user  system elapsed \n#15.080   0.060  15.157 \n```\n\nSince quantile has so many options, maybe a function called `qtile` which implements only one kind (the kind used for `ntile`).\n
#> 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       It would be nice to allow a trailing comma in commands like the example below.  That would make it easier to rearrange the order or to comment out lines without needing to figure out whether the commas need adjusting.\n\n```\ndiamonds %>%\n  group_by(color, clarity) %>%\n  summarise(\n    depth = mean(depth),\n    price = mean(price),\n    carat = mean(carat),  #  <-- trailing comma here currently causes errors\n  )\n```\n\nI'm imaging a function that could be called at the top of a function definition that would gobble this empty argument.  Or perhaps a wrapper function a la `Vectorize()` that could be used to convert any function into a function with this behavior.  Seems like this could be useful for many functions both within `dplyr` and elsewhere.\n
#> 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    It would be useful to have a function, say `discard`, that discards rows where the condition evaluates to TRUE. This function would be different from `filter(!condition)` when condition evaluates to NA. Example below\n\n``` R\nx <- data.frame(v1 = c(1, 2, NA))\n#>  v1\n#>1  1\n#>2  2\n#>3 NA\nfilter(x, !(v1 == 1))\n#>  v1\n#>1  2\ndiscard(x, v1 == 1)\n#>  v1\n#>1  2\n#>2 NA\n```\n
#> 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   pig has a great command called ILLUSTRATE that demonstrates how a sample of data will be transformed through a series of commands. it's very useful for debugging, and a great educational tool to generate example data transformations.\n\nneedless to say, it'd be awesome if dplyr had a similar function.\n\nmore info here:\n\nhttp://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#ILLUSTRATE\nhttp://wiki.apache.org/pig/PigIllustrate\n
#> 6 When filtering the table from the database the error is produced if `%in%` is used and the right hand side fo it is of length one:\n\n```\nlibrary(dplyr)\nlibrary(DBI)\nlibrary(RSQLite)\nlibrary(RPostgreSQL)\nlibrary(RMySQL)\nlibrary(magrittr)\n\ndt <- expand.grid(Year=as.integer(1990:2014),Product=LETTERS[1:8],Country=paste0(LETTERS,"I")) %>% select(Product,Country,Year)\ndt$value <- rnorm(nrow(dt))\n\ndt <- dt %>% group_by(Product) %>% mutate(value=value+as.numeric(Product)/10)\ndt <- dt %>% group_by(Country) %>% mutate(value=value+as.numeric(Country)/10)\ndriver = dbDriver("SQLite")\ncon = dbConnect(driver, dbname = "test.db")\ndbWriteTable(con, c("dummy_test"), value=dt %>% data.frame,overwrite=TRUE,row.names=FALSE)\ndbDisconnect(con)\n\nmy_db <- src_sqlite("test.db")\nprint(my_db)\ndbdt <- tbl(my_db,"dummy_test")\n\n##test1\ndbdt %>% filter(Product %in% "A" & Country %in% "AI" & Year %in% 1990 )\n##test2\ndbdt %>% filter(Product == "A" & Country == "AI" & Year == 1990 )\n##test3\ndbdt %>% filter(Product %in% c("A", "B") & Country %in% c("AI", "AJ") & Year %in% c(1990, 1991) )\n\n\n```\n\nThe `test1` produces error, while the `test2` and `test3` pass without the problem. In R `test1` and `test2` should produce identical results. The same error is given with PostgreSQL and MySQL databases. Here is my sessionInfo:\n\n```\n> sessionInfo()\nR version 3.1.1 (2014-07-10)\nPlatform: x86_64-apple-darwin13.1.0 (64-bit)\n\nlocale:\n[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/C\n\nattached base packages:\n[1] stats     graphics  grDevices utils     datasets  methods   base     \n\nother attached packages:\n[1] magrittr_1.0.1  RPostgreSQL_0.4 RSQLite_0.11.4  dplyr_0.2      \n[5] RMySQL_0.9-3    DBI_0.2-7      \n\nloaded via a namespace (and not attached):\n[1] assertthat_0.1 compiler_3.1.1 parallel_3.1.1 Rcpp_0.11.2    tools_3.1.1   \n```\n
#>   repo_owner repo_name labels_name assignees_login
#> 1       <NA>      <NA>     feature                
#> 2       <NA>      <NA>                            
#> 3       <NA>      <NA>     feature hadley, lionel-
#> 4       <NA>      <NA>                            
#> 5       <NA>      <NA>     feature          hadley
#> 6       <NA>      <NA>  bug :bomb:          hadley

Columns that can contain multiple values (e.g. multiple label names or assignees) appear as lists.

head( dplyr_issues[,c("labels_name", "assignees_login")] )
#>   labels_name assignees_login
#> 1     feature                
#> 2                            
#> 3     feature hadley, lionel-
#> 4                            
#> 5     feature          hadley
#> 6  bug :bomb:          hadley
dplyr_issues$assignees_login
#> [[1]]
#> character(0)
#> 
#> [[2]]
#> character(0)
#> 
#> [[3]]
#> [1] "hadley"  "lionel-"
#> 
#> [[4]]
#> character(0)
#> 
#> [[5]]
#> [1] "hadley"
#> 
#> [[6]]
#> [1] "hadley"
#> 
#> [[7]]
#> [1] "hadley"
#> 
#> [[8]]
#> [1] "hadley"
#> 
#> [[9]]
#> character(0)
#> 
#> [[10]]
#> character(0)
#> 
#> [[11]]
#> character(0)
#> 
#> [[12]]
#> [1] "hadley"
#> 
#> [[13]]
#> character(0)
#> 
#> [[14]]
#> character(0)
#> 
#> [[15]]
#> character(0)
#> 
#> [[16]]
#> character(0)

Dataframes can be “expanded” to the issue-assignee or issue-label granulariy using the tidyr::unnest() function. For example, the following code expands the dataframe to one row per issue-assignee. For example, after running that command, you can see the first two rows refer to the same assign but different assignees.

dplyr_issues %>%
  tidyr::unnest(assignees_login) %>%
  dplyr::select(number, title, assignees_login) %>%
  head()

In summary, the general process is calling a get_ function followed by a parse_ function. We will get milestones for a second example.

dplyr_milestones <-
  get_milestones(dplyr, state = 'all') %>%
  parse_milestones()

Reporting On Issue Status

The report_ function family offers an alternative to visualizations. These functions generate HTML for aethetic output in RMarkdown reports. Output is automatically tagged so knitr knows to interpret it as HTML, so it is not necessary to manually add the results = 'asis' chunk option. (Don’t worry if you don’t know what this means. You don’t need to do anything!)

report_progress(dplyr_issues)

bluesky ( 100 % Complete - 16 / 16 )

☑ Pass custom functions to select()
☑ hybrid handler for quantile
☑ Should dplyr commands gobble empty last argument?
☑ Antonym to filter
☑ equivalent of pig’s ILLUSTRATE
☑ SQL error, when the right hand side of %in% is of length one in filter
☑ Granular class
☑ Think about how to incorporate matrixUtils with tbl_cube
☑ Rolling join
☑ take advantage of chaining
☑ parallel execution
☑ Need ability to register custom sql functions?
☑ Big data bottlenecks
☑ Should tbl_dt be lazy?
☑ Implement ganalytics backend
☑ MonetDB backend

Posting Issues

The post_ function family helps add new objects to a GitHub repo. For example, the following command adds a new issue to a repository. After posting new content, post_ functions return the identification number for the new object.

experigit <- create_repo_ref('emilyriederer', 'experigit')
post_issue(experigit,
           title = "Add unit tests for post_issues when title duplicated",
           body = "Check that code appropriately warns users when attempting to post a duplicate issue",
           labels = c("enhancement", "test"),
           assignees = "emilyriederer" )

#> [1] 150

The GitHub API allows multiple issues to have the same title. However, you may want to disable this functionality (for example, if a post_ function is in a script that may be re-run.) In this case, the distinct parameter allows you to chose whether or not to allow the posting of new issues with the same title as open existing issues. When distinct = TRUE (as it is by default), the function throws an error and does not post the issue.

post_issue(experigit, title = "Add unit tests for post_issues when title duplicated")

#> Error: New issue title is not distinct with current open issues. Please change title or set distinct = FALSE.