This vignette illustrates basic projmgr
codeflows.
library(projmgr) #> Warning: package 'projmgr' was built under R version 4.0.5
The first step to interacting with a GitHub repository is creating a
repository reference. Repository references are “first class citizens”
in projmgr
and contain all authorization credentials. These
references are passed as the first argument in all get_
functions and post_
functions.
Suppose, for example, we are interested in pulling data about issues
in the dplyr
repository. We would start off by creating a
repository reference with the create_repo_ref()
function.
dplyr <- create_repo_ref('tidyverse', 'dplyr') #> Requests will authenticate with GITHUB_PAT
Note that this function has many additional parameters that you can specify as needed. For example:
is_enterprise = TRUE
and pass your company’s internal
GitHub URL in through hostname
is_enterprise = TRUE
), the name you used can be passed
through hostname
identifier
lets you provide your credentials in
different ways if you haven’t set up a PAT or it is called something
else (see the GitHub PAT vignette or the create_repo_ref()
documentation for more details)The check_
family of functions provide some basic
validations that everything is working.
check_internet()
tests for problems connecting to the
internet.
#> [1] TRUE
check_credentials()
confirms the login associated with
the PAT and describes the level of access that PAT have to the specified
repo.
check_credentials(dplyr)
#> -- With provided credentials -- #> + Login: emilyriederer #> + Type: User #> -- In the dplyr repo -- #> + Admin: FALSE #> + Push: FALSE #> + Pull: FALSE
check_rate_limit()
checks how many more API requests you
can send and when that count will reset. Note that this accepts a repo
reference as a parameter only to get authentication information for
an account. Limits to requests are established at the account-level
not the repository level.
check_rate_limit(dplyr)
#> 4998 / 5000 (Resets at 22:26:20)
get_
functions retrieve information from the GitHub API.
The first argument is the repository reference and additional named
query parameters can be passed subsequently.
For example, here, we request issues of the first
milestone
with either an open or closed state
.
(Note that, in keeping with the GitHub API, only open issues are
returned when state
is not specified. If you’re trying to
build a productivity report, i.e. about everything that’s been
completed, it’s very important to specify state
as
“closed” or “all”.)
dplyr_issues_list <- get_issues(dplyr, milestone = 1, state = 'all')
If you don’t know what parameters are available, the
help_<function_name>
family provides more information
on valid arguments to include in get_
and
post_
functions.
help_get_issues() #> [1] "milestone" "state" "assignee" "creator" "mentioned" "labels" #> [7] "sort" "direction" "since"
Or take a guess. The get_
functions will check that all
of your named query parameters are accepted by the API and throw an
error for any that are unrecognized.
get_issues(dplyr, not_a_real_parameter = 'abc')
#> Error: The following user-inputted variables are not relevant to this API request: #> + not_a_real_parameter #> Allowed variables are: #> + milestone,state,assignee,creator,mentioned,labels,sort,direction,since #> Please remove unallowed fields and try again. #> Use the browse_docs() function or visit https://developer.github.com/v3/ for full API documentation.
As the get_issues()
error message says, detailed
documentation can also be viewed using the browse_docs()
function. This function launches your browser to the appropriate part of
the GitHub API documentation. For example, one might run:
browse_docs(action = 'get', object = 'issue') #> Open URL https://developer.github.com/v3/issues/#list-issues-for-a-repository
Results are returned as a list, closely mirror the JSON output from the actual API.
str(dplyr_issues_list[[1]], max.level = 1) #> List of 23 #> $ url : chr "https://api.github.com/repos/tidyverse/dplyr/issues/1229" #> $ repository_url : chr "https://api.github.com/repos/tidyverse/dplyr" #> $ labels_url : chr "https://api.github.com/repos/tidyverse/dplyr/issues/1229/labels{/name}" #> $ comments_url : chr "https://api.github.com/repos/tidyverse/dplyr/issues/1229/comments" #> $ events_url : chr "https://api.github.com/repos/tidyverse/dplyr/issues/1229/events" #> $ html_url : chr "https://github.com/tidyverse/dplyr/issues/1229" #> $ id : int 89591755 #> $ node_id : chr "MDU6SXNzdWU4OTU5MTc1NQ==" #> $ number : int 1229 #> $ title : chr "Pass custom functions to select()" #> $ user :List of 18 #> $ labels :List of 1 #> $ state : chr "closed" #> $ locked : logi TRUE #> $ assignee : NULL #> $ assignees : list() #> $ milestone :List of 16 #> $ comments : int 1 #> $ created_at : chr "2015-06-19T15:26:31Z" #> $ updated_at : chr "2018-06-08T13:56:05Z" #> $ closed_at : chr "2017-02-02T21:10:56Z" #> $ author_association: chr "NONE" #> $ body : chr "Right now, you're limited in column selection when using select() and magrittr.\n\nConsider the situation where"| __truncated__
It’s likely you may prefer to work with them as dataframes instead.
The parse_
family of functions converts “raw” list output
from get_
into a dataframe.
dplyr_issues <- parse_issues(dplyr_issues_list) head(dplyr_issues) #> url id number #> 1 https://github.com/tidyverse/dplyr/issues/1229 89591755 1229 #> 2 https://github.com/tidyverse/dplyr/issues/1183 82642129 1183 #> 3 https://github.com/tidyverse/dplyr/issues/1039 64353442 1039 #> 4 https://github.com/tidyverse/dplyr/issues/741 47540964 741 #> 5 https://github.com/tidyverse/dplyr/issues/549 40514628 549 #> 6 https://github.com/tidyverse/dplyr/issues/511 38883873 511 #> title #> 1 Pass custom functions to select() #> 2 hybrid handler for quantile #> 3 Should dplyr commands gobble empty last argument? #> 4 Antonym to filter #> 5 equivalent of pig's ILLUSTRATE #> 6 SQL error, when the right hand side of %in% is of length one in filter #> user_login user_id state locked milestone_title milestone_id #> 1 msjgriffiths 1093821 closed TRUE bluesky 491295 #> 2 matthieugomez 6223837 closed TRUE bluesky 491295 #> 3 rpruim 722231 closed TRUE bluesky 491295 #> 4 matthieugomez 6223837 closed TRUE bluesky 491295 #> 5 jhofman 79563 closed TRUE bluesky 491295 #> 6 vzemlys 320871 closed TRUE bluesky 491295 #> milestone_number milestone_state milestone_created_at milestone_closed_at #> 1 1 open 2013-11-20 <NA> #> 2 1 open 2013-11-20 <NA> #> 3 1 open 2013-11-20 <NA> #> 4 1 open 2013-11-20 <NA> #> 5 1 open 2013-11-20 <NA> #> 6 1 open 2013-11-20 <NA> #> milestone_due_on n_comments created_at updated_at closed_at #> 1 <NA> 1 2015-06-19 2018-06-08 2017-02-02 #> 2 <NA> 4 2015-05-30 2018-06-08 2017-02-02 #> 3 <NA> 7 2015-03-25 2018-06-08 2017-03-24 #> 4 <NA> 3 2014-11-02 2018-06-08 2017-02-02 #> 5 <NA> 10 2014-08-18 2018-06-08 2017-02-02 #> 6 <NA> 16 2014-07-28 2018-06-08 2017-02-14 #> author_association #> 1 NONE #> 2 NONE #> 3 NONE #> 4 NONE #> 5 NONE #> 6 NONE #> body #> 1 Right now, you're limited in column selection when using select() and magrittr.\n\nConsider the situation where you have a table, and you want to select certain columns at an intermediate point in a transformation pipeline, but don't want to (i) use fixed column indexes and (ii) don't want to save out an intermediary variable and identify column names separately.\n\nIt would make sense to be able to define a custom function in the same format as `everything()` or `one_of()`, and have `select()` and `gather()` and other functions use a custom function for selection variables.\n\nIn principle the function you take in a list of column names (e.g. `names(data)`) and then output a list of integer positions (e.g. `c(1, 2, 7)`. \n\nExample usage:\n\n``` r\n\n# Pass in column that expects the first parameter to be a list of column names.\ndata %>% gather(key, value, function(vars) {\n vars[-grep("Drop", vars, fixed = TRUE)]\n})\n\n# Define a function that expect the first parameter to be column names, second to be matches.\nnot_contains <- function(vars, match){\n vars[-grep(match, vars, fixed = TRUE)]\n}\ndata %>% select(not_contains("avoid"))\n```\n\nLess ideally, allow custom functions to be registered as [part of `select_funs`](https://github.com/hadley/dplyr/blob/master/R/select-vars.R#L67), e.g. something like:\n\n``` r\nnot_contains <- function(vars, match){\n vars[-grep(match, vars, fixed = TRUE)]\n}\n\ndplyr::register_select_funs(not_contains)\n\ndata %>% select(not_contains("avoid"))\n```\n #> 2 It would be great to have an hybrid hander for quantile. For now it is orders of magnitude slower than mean:\n\n``` R\ndf <- data_frame(\n id = sample(1e5, 1e6, TRUE), \n v = sample(5, 1e6, TRUE) \n )\ndf1 <- df %>% group_by(id)\nsystem.time(summarize(df1, mean(v)))\n# user system elapsed \n# 0.041 0.001 0.056 \nsystem.time(summarize(df1, quantile(v, 0.50, type = 1)))\n# user system elapsed \n#15.080 0.060 15.157 \n```\n\nSince quantile has so many options, maybe a function called `qtile` which implements only one kind (the kind used for `ntile`).\n #> 3 It would be nice to allow a trailing comma in commands like the example below. That would make it easier to rearrange the order or to comment out lines without needing to figure out whether the commas need adjusting.\n\n```\ndiamonds %>%\n group_by(color, clarity) %>%\n summarise(\n depth = mean(depth),\n price = mean(price),\n carat = mean(carat), # <-- trailing comma here currently causes errors\n )\n```\n\nI'm imaging a function that could be called at the top of a function definition that would gobble this empty argument. Or perhaps a wrapper function a la `Vectorize()` that could be used to convert any function into a function with this behavior. Seems like this could be useful for many functions both within `dplyr` and elsewhere.\n #> 4 It would be useful to have a function, say `discard`, that discards rows where the condition evaluates to TRUE. This function would be different from `filter(!condition)` when condition evaluates to NA. Example below\n\n``` R\nx <- data.frame(v1 = c(1, 2, NA))\n#> v1\n#>1 1\n#>2 2\n#>3 NA\nfilter(x, !(v1 == 1))\n#> v1\n#>1 2\ndiscard(x, v1 == 1)\n#> v1\n#>1 2\n#>2 NA\n```\n #> 5 pig has a great command called ILLUSTRATE that demonstrates how a sample of data will be transformed through a series of commands. it's very useful for debugging, and a great educational tool to generate example data transformations.\n\nneedless to say, it'd be awesome if dplyr had a similar function.\n\nmore info here:\n\nhttp://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#ILLUSTRATE\nhttp://wiki.apache.org/pig/PigIllustrate\n #> 6 When filtering the table from the database the error is produced if `%in%` is used and the right hand side fo it is of length one:\n\n```\nlibrary(dplyr)\nlibrary(DBI)\nlibrary(RSQLite)\nlibrary(RPostgreSQL)\nlibrary(RMySQL)\nlibrary(magrittr)\n\ndt <- expand.grid(Year=as.integer(1990:2014),Product=LETTERS[1:8],Country=paste0(LETTERS,"I")) %>% select(Product,Country,Year)\ndt$value <- rnorm(nrow(dt))\n\ndt <- dt %>% group_by(Product) %>% mutate(value=value+as.numeric(Product)/10)\ndt <- dt %>% group_by(Country) %>% mutate(value=value+as.numeric(Country)/10)\ndriver = dbDriver("SQLite")\ncon = dbConnect(driver, dbname = "test.db")\ndbWriteTable(con, c("dummy_test"), value=dt %>% data.frame,overwrite=TRUE,row.names=FALSE)\ndbDisconnect(con)\n\nmy_db <- src_sqlite("test.db")\nprint(my_db)\ndbdt <- tbl(my_db,"dummy_test")\n\n##test1\ndbdt %>% filter(Product %in% "A" & Country %in% "AI" & Year %in% 1990 )\n##test2\ndbdt %>% filter(Product == "A" & Country == "AI" & Year == 1990 )\n##test3\ndbdt %>% filter(Product %in% c("A", "B") & Country %in% c("AI", "AJ") & Year %in% c(1990, 1991) )\n\n\n```\n\nThe `test1` produces error, while the `test2` and `test3` pass without the problem. In R `test1` and `test2` should produce identical results. The same error is given with PostgreSQL and MySQL databases. Here is my sessionInfo:\n\n```\n> sessionInfo()\nR version 3.1.1 (2014-07-10)\nPlatform: x86_64-apple-darwin13.1.0 (64-bit)\n\nlocale:\n[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/C\n\nattached base packages:\n[1] stats graphics grDevices utils datasets methods base \n\nother attached packages:\n[1] magrittr_1.0.1 RPostgreSQL_0.4 RSQLite_0.11.4 dplyr_0.2 \n[5] RMySQL_0.9-3 DBI_0.2-7 \n\nloaded via a namespace (and not attached):\n[1] assertthat_0.1 compiler_3.1.1 parallel_3.1.1 Rcpp_0.11.2 tools_3.1.1 \n```\n #> repo_owner repo_name labels_name assignees_login #> 1 <NA> <NA> feature #> 2 <NA> <NA> #> 3 <NA> <NA> feature hadley, lionel- #> 4 <NA> <NA> #> 5 <NA> <NA> feature hadley #> 6 <NA> <NA> bug :bomb: hadley
Columns that can contain multiple values (e.g. multiple label names or assignees) appear as lists.
head( dplyr_issues[,c("labels_name", "assignees_login")] ) #> labels_name assignees_login #> 1 feature #> 2 #> 3 feature hadley, lionel- #> 4 #> 5 feature hadley #> 6 bug :bomb: hadley dplyr_issues$assignees_login #> [[1]] #> character(0) #> #> [[2]] #> character(0) #> #> [[3]] #> [1] "hadley" "lionel-" #> #> [[4]] #> character(0) #> #> [[5]] #> [1] "hadley" #> #> [[6]] #> [1] "hadley" #> #> [[7]] #> [1] "hadley" #> #> [[8]] #> [1] "hadley" #> #> [[9]] #> character(0) #> #> [[10]] #> character(0) #> #> [[11]] #> character(0) #> #> [[12]] #> [1] "hadley" #> #> [[13]] #> character(0) #> #> [[14]] #> character(0) #> #> [[15]] #> character(0) #> #> [[16]] #> character(0)
Dataframes can be “expanded” to the issue-assignee or issue-label
granulariy using the tidyr::unnest()
function. For example,
the following code expands the dataframe to one row per issue-assignee.
For example, after running that command, you can see the first two rows
refer to the same assign but different assignees.
dplyr_issues %>% tidyr::unnest(assignees_login) %>% dplyr::select(number, title, assignees_login) %>% head()
In summary, the general process is calling a get_
function followed by a parse_
function. We will get
milestones for a second example.
dplyr_milestones <- get_milestones(dplyr, state = 'all') %>% parse_milestones()
The report_
function family offers an alternative to
visualizations. These functions generate HTML for aethetic output in
RMarkdown reports. Output is automatically tagged so knitr
knows to interpret it as HTML, so it is not necessary to manually add
the results = 'asis'
chunk option. (Don’t worry if you
don’t know what this means. You don’t need to do anything!)
report_progress(dplyr_issues)
The post_
function family helps add new objects to a
GitHub repo. For example, the following command adds a new issue to a
repository. After posting new content, post_
functions
return the identification number for the new object.
experigit <- create_repo_ref('emilyriederer', 'experigit') post_issue(experigit, title = "Add unit tests for post_issues when title duplicated", body = "Check that code appropriately warns users when attempting to post a duplicate issue", labels = c("enhancement", "test"), assignees = "emilyriederer" )
#> [1] 150
The GitHub API allows multiple issues to have the same title.
However, you may want to disable this functionality (for example, if a
post_
function is in a script that may be re-run.) In this
case, the distinct
parameter allows you to chose whether or
not to allow the posting of new issues with the same title as
open existing issues. When distinct = TRUE
(as it
is by default), the function throws an error and does not post the
issue.
post_issue(experigit, title = "Add unit tests for post_issues when title duplicated")
#> Error: New issue title is not distinct with current open issues. Please change title or set distinct = FALSE.