--- title: "brief-introduction-to-select-syntax" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{brief-introduction-to-select-syntax} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Version Note: Up-to-date with v0.3.0 ```{r, include = FALSE} knitr::opts_chunk$set(message=FALSE,warning = FALSE, comment = NA) old.hooks <- fansi::set_knit_hooks(knitr::knit_hooks) ``` ```{r setup} library(psycModel) library(dplyr) ``` This article briefly introduces the usage of `dplyr::select`, and how it is applied to this package. In the first section, I will briefly describe `dplyr::select` syntax for R-beginners. If you are already familiar with `dplyr::select` syntax, then you can skip to the next section where I describe how to apply the syntax in this pacakge. # Introduction to the select syntax `dplyr::select` (abbreviated as `select` hereafter) is an extremely power function for R. It allows you to subset columns with the a set of syntax that is also known as the `select` syntax / semantics in the R community. A side note here. With the new introduction of `dplyr::across` function, the `select` syntax can be applied to `dplyr::mutate` and `dplyr::filter` where make these two already powerful function even more powerful. I will first introduce the usage of `:`, `c()` and `-`. Then, I will discuss how to use `everything`, `starts_with`, `end_with`, `contains`, and `where`. This is not an exhaustive list of the `select` syntax, but there are the most relevant one. If you want to learn more, I encourage you to check the vignette of `dplyr` or just google it. There are tons of article that discuss this in detail. ## Basic select syntax I am going to use the `iris` dataset for the demonstration. Let's take a quick peek of the dataset. ```{r} iris %>% head() # head() show the first 5 rows of the data frame ``` If I want to select the first 3 columns, you can use `:` to do that ```{r} iris %>% select(1:3) %>% head(1) iris %>% select(Sepal.Length:Petal.Length) %>% head(1) ``` Next, if you want to combine selection then you can use `c()`. For example, I want the 1st, 3rd and 4th columns. Then, you can do it like this ```{r} iris %>% select(c(1, 3:4)) %>% head(1) iris %>% select(Sepal.Length, Petal.Length:Petal.Width) %>% head(1) ``` Finally, if you want to delete a column from selection, then you can use `-`. For example, you want to select all columns except the 3rd column, then you can do it like this ```{r} iris %>% select(1:5, -3) %>% head(1) iris %>% select(Sepal.Length:Species, -Petal.Length) %>% head(1) ``` ## Tidyselect helpers Ok. Now you understand the basic usage. Let's get to something a little bit more advanced. First, let's talk about my favorite which is `everything`. As the name entails, it select all the variables in the data frame. It is usually used in combination with `c()` if you are using in `select` function. However, it is very powerful in other use cases like the one in this package. For example, you want to fit a linear regression with all the variables, then you can use `everything` (a more detailed discussion is presented in the next section). ```{r} # select all columns iris %>% select(everything()) %>% head(1) # select everything except Sepal.Width iris %>% select(c(everything(),-Sepal.Width)) %>% head(1) ``` Next, we can talk about `starts_with`. `starts_with` select all columns that is starts with a certain specified string. For example, we want to select all columns start with Sepal, then we can do something like this ```{r} iris %>% select(starts_with('Sepal')) %>% head(1) ``` Similar to `starts_with`, `ends_with` select all columns that is ends with a certain specified string. For example, we want to select all columns ends with Width. ```{r} iris %>% select(ends_with('Width')) %>% head(1) ``` Next, we are going talk about `contains`. As the name entails, it select all columns that contains a specified string. ```{r} iris %>% select(contains('Sepal')) %>% head(1) # same as starts_with iris %>% select(contains('Width')) %>% head(1) # same as ends_with iris %>% select(contains('.')) %>% head(1) # contains "." will be selected ``` Finally, we are going to conclude this section with `where`. `where` is not used alone. It is usually pair with a function return `TRUE` or `FALSE`. I think the most common use case for this package is paired with `is.numeric`. `where(is.numeric)` will select all numeric variables. A little tip, you need to pass `is.numeric` instead of `is.numeric()`. I will not go into the detail of why because this is out of the scope of this article. It required a little bit more advanced understanding of how function work in R. If you have that, you wouldn't reading this article anyway. ```{r} iris %>% select(where(is.numeric)) %>% head(1) ``` # Applying select syntax in this package First, I will demonstrate the usage of linear regression. I will first create a data frame. You don't need to know anything about how this data frame is created. Just know that it has 1 DV / outcome / response variable (i.e, y) and 5 IV / predictor variable (i.e, x1 to x5) ```{r} set.seed(1) test_data = data.frame(y = rnorm(n = 100,mean = 2,sd = 3), x1 = rnorm(n = 100,mean = 1.5, sd = 4), x2 = rnorm(n = 100,mean = 1.7, sd = 4), x3 = rnorm(n = 100,mean = 1.5, sd = 4), x4 = rnorm(n = 100,mean = 2, sd = 4), x5 = rnorm(n = 100,mean = 1.5, sd = 4)) ``` Ok, let's fit that linear regression now. ```{r} # Without this package: model1 = lm(data = test_data, formula = y ~ x1 + x2 + x3 + x4 + x5) # With this package: model2 = lm_model(data = test_data, response_variable = y, predictor_variable = c(everything(),-y)) ``` This is already a step up from the basic `lm()` function. We can still make is even simpler by just passing `everyhing()`. The function is designed to remove the response variable from predictor variables (if selected) automatically. The following `model3` is the same as `model2` ```{r} model3 = lm_model(data = test_data, response_variable = y, predictor_variable = everything()) ``` The same logic is applied to all other functions in this package. Arguments that support `dplyr::select` syntax will ends with "support dplyr::select syntax" in the description of the argument.That's it for this brief introduction. If you want to learn more about this package, I encourage you to check out this [article](https://jasonmoy28.github.io/psycModel/articles/quick-introduction.html) or use `vignette('quick-introduction')` if you are in R Studio.