MLBA Lab 1

MLBA Tools & Lab Setup

Ilia Azizi

03.03.2024

Objectives

  • Learn about vesion-control, Git & GitHub
  • Use R and Rstudio
  • Learn about virtual environments in R to ensure reproducibility.
  • Learn about using python 🐍 in (and with) R. This is useful for some ML lab sessions, and cutting-edge ML is often first implemented in python.

Git & GitHub for Collaboration

  • What is Git? Git is a version-control system that helps track changes, collaborate on projects, and revert to previous versions of a file.
  • What is GitHub? GitHub provides a cloud-based hosting service that lets you manage Git repositories.
  • Git vs. GitHub: Git is a version-control technology to manage source code history, while GitHub is only one of the hosting service for Git repositories.

Creating a Repository on GitHub

  1. Navigate to GitHub and create a new repository.
  2. Choose a name and description for your repository.
  3. Select whether the repository is public or private.
  4. Click “Create repository.”

Install GitHub Desktop app to help you with using GitHub. Additionally, you can see our FAQ for obtaining professional accounts.

GitHub Workflow

  • Cloning a Repository with GitHub Desktop
    1. Open GitHub Desktop and clone the repository to your local machine.
    2. This creates a local copy of the repository for work and synchronization.
  • Committing Changes
    1. Make changes to your files in the project directory.
    2. Use GitHub Desktop to commit these changes, adding a meaningful commit message.
  • Push and Pull Changes
    1. Push your committed changes to GitHub to share with collaborators.
    2. Pull changes made by others to keep your local repository up to date.

Using RStudio Projects

  • RStudio projects simplify the management of R soure code.
  • Use the {here} package for easy file path management within projects.

Virtual environments in R (renv)

The What & The Why

  • renv is a package management tool that helps you manage the packages used in an R project.
  • Ensures that your project is reproducible.
  • Provides a consistent environment by isolating the packages used in your project.
  • Simplifies installation and setup.
  • Helps you avoid compatibility issues.
  • Makes it easy to share your work with others.

Virtual environments in R (renv)

The How

  1. Create a new renv project with renv::init().
  2. renv::restore() to install packages from the renv.lock file.
  3. Use renv::snapshot() to occasionally update your packages.
  4. Use renv::status() to see if the list in renv.lock needs updating.

Python 🐍 in R (reticulate)

Small Motivation

  • Python is arguably more demanded in machine learning than R.
  • Widely-used language in the industry.
  • Powerful libraries for data manipulation, analysis, and modeling.
  • Relatively easy to pick up even for beginners to programming.
  • Combining the strengths of both R and Python can enhance your skills.

Python 🐍 in R (reticulate)

Configuration

  1. Install reticulate package in R.
  2. Use reticulate::use_python() or reticulate::use_condaenv() to specify the location of your python environment.
reticulate::use_condaenv("MLBA")
reticulate::py_config()
  1. Use reticulate::import() to import python modules in R.
pd <- import("pandas")
  1. Use reticulate::py_run_string() to execute python code in R.
py_run_string("x = 3; y = 4; print('The sum of x and y is:', x + y)")
#> The sum of x and y is: 7

Running Python code in R

  • To run Python code in R, use {python} at the beginning of the code chunk.
```{python}
my_dict = {'a' : 3, 'b' : 5, 'c' : 6} 
```
  • To access R objects in Python, use r.OBJECT_NAME.
```{r}
my_list <- list(a = 1, b = 2, c = 3)
```
```{python}
print(r.my_list['b'])
```
  • To access Python objects in R, use py$OBJECT_NAME.
```{r}
print(py$my_dict$b)
```

Object casting in Python & R

  • Use reticulate::r_to_py() and reticulate::py_to_r() to explicitly change between objects.

Example: Plotting with R & Python

Load some data
## load mtcars dataset
data(mtcars)

 

Plotting with base R
# Using base R plot
plot(mtcars$mpg, mtcars$disp)

Plotting by using python within R
# Using `matplotlib`
plt <- reticulate::import("matplotlib.pyplot")
plt$scatter(mtcars$mpg, mtcars$disp)
plt$xlabel("mpg", fontsize = 12)
plt$ylabel("disp", fontsize = 12)

Example: Regression in R & Python

Loading the data R
# load the data in R
data(iris)
# exclude the `species` column (we
# focus on regression here)
iris <- select(iris, -"Species")
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1          5.1         3.5          1.4         0.2
#> 2          4.9         3.0          1.4         0.2
#> 3          4.7         3.2          1.3         0.2
#> 4          4.6         3.1          1.5         0.2
#> 5          5.0         3.6          1.4         0.2
#> 6          5.4         3.9          1.7         0.4
Modelling in pure R
# example of running a model on iris
# data
r_lm <- lm("Sepal.Length ~. ", data = iris)
summary(r_lm)
#> 
#> Call:
#> lm(formula = "Sepal.Length ~. ", data = iris)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.82816 -0.21989  0.01875  0.19709  0.84570 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   1.85600    0.25078   7.401 9.85e-12 ***
#> Sepal.Width   0.65084    0.06665   9.765  < 2e-16 ***
#> Petal.Length  0.70913    0.05672  12.502  < 2e-16 ***
#> Petal.Width  -0.55648    0.12755  -4.363 2.41e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3145 on 146 degrees of freedom
#> Multiple R-squared:  0.8586, Adjusted R-squared:  0.8557 
#> F-statistic: 295.5 on 3 and 146 DF,  p-value: < 2.2e-16
Modelling in pure Python
# load the required libraries
import statsmodels.api as sm
import pandas as pd
# Fit linear regression model to iris coming from R
X = r.iris[['Sepal.Width','Petal.Length','Petal.Width']]
y = r.iris['Sepal.Length']
X = sm.add_constant(X)
py_lm_fit = sm.OLS(y, X).fit()
#print regression results
print(py_lm_fit.summary())
#>                             OLS Regression Results                            
#> ==============================================================================
#> Dep. Variable:           Sepal.Length   R-squared:                       0.859
#> Model:                            OLS   Adj. R-squared:                  0.856
#> Method:                 Least Squares   F-statistic:                     295.5
#> Date:                Fri, 31 May 2024   Prob (F-statistic):           8.59e-62
#> Time:                        08:25:17   Log-Likelihood:                -37.321
#> No. Observations:                 150   AIC:                             82.64
#> Df Residuals:                     146   BIC:                             94.69
#> Df Model:                           3                                         
#> Covariance Type:            nonrobust                                         
#> ================================================================================
#>                    coef    std err          t      P>|t|      [0.025      0.975]
#> --------------------------------------------------------------------------------
#> const            1.8560      0.251      7.401      0.000       1.360       2.352
#> Sepal.Width      0.6508      0.067      9.765      0.000       0.519       0.783
#> Petal.Length     0.7091      0.057     12.502      0.000       0.597       0.821
#> Petal.Width     -0.5565      0.128     -4.363      0.000      -0.809      -0.304
#> ==============================================================================
#> Omnibus:                        0.345   Durbin-Watson:                   2.060
#> Prob(Omnibus):                  0.842   Jarque-Bera (JB):                0.504
#> Skew:                           0.007   Prob(JB):                        0.777
#> Kurtosis:                       2.716   Cond. No.                         54.7
#> ==============================================================================
#> 
#> Notes:
#> [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Modelling in R with Python libraries
# load the library
sm <- import("statsmodels.api")
# model the data
py_lm <- sm$OLS(dplyr::pull(iris, "Sepal.Length"),
    dplyr::select(iris, -"Sepal.Length"))
# fit the data
py_lm_fit <- py_lm$fit()
# print the summary
print(py_lm_fit$summary())
#> <class 'statsmodels.iolib.summary.Summary'>
#> """
#>                                  OLS Regression Results                                
#> =======================================================================================
#> Dep. Variable:                      y   R-squared (uncentered):                   0.996
#> Model:                            OLS   Adj. R-squared (uncentered):              0.996
#> Method:                 Least Squares   F-statistic:                          1.284e+04
#> Date:                Fri, 31 May 2024   Prob (F-statistic):                   1.33e-177
#> Time:                        08:25:18   Log-Likelihood:                         -61.215
#> No. Observations:                 150   AIC:                                      128.4
#> Df Residuals:                     147   BIC:                                      137.5
#> Df Model:                           3                                                  
#> Covariance Type:            nonrobust                                                  
#> ================================================================================
#>                    coef    std err          t      P>|t|      [0.025      0.975]
#> --------------------------------------------------------------------------------
#> Sepal.Width      1.1211      0.024     47.658      0.000       1.075       1.168
#> Petal.Length     0.9235      0.057     16.205      0.000       0.811       1.036
#> Petal.Width     -0.8957      0.139     -6.439      0.000      -1.171      -0.621
#> ==============================================================================
#> Omnibus:                        0.421   Durbin-Watson:                   2.007
#> Prob(Omnibus):                  0.810   Jarque-Bera (JB):                0.570
#> Skew:                           0.026   Prob(JB):                        0.752
#> Kurtosis:                       2.703   Cond. No.                         26.0
#> ==============================================================================
#> 
#> Notes:
#> [1] R² is computed without centering (uncentered) since the model does not contain a constant.
#> [2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#> """

Conclusion

  • renv helps with managing packages in R, ensuring reproducibility, and making your work easier to share.
  • reticulate allows you to use python in R and combine the strengths of both languages.
  • Learning these tools will help you become more effective in machine learning.
  • Let’s get started with the lab exercises!

Thank You for your attention!

Questions?