Prerequisites

  • R is installed
  • Working knowledge of basic algebra
    • you know what PEMDAS stands for
    • you know what substitution is
      • i.e. solve for y when y = 3x - 3, and x = 1
  • Working knowledge of basic statistics
    • mean, median, mode
    • you kind of remember what a t-test is and what a p-value is
  • Working knowledge of basic computer science principles
    • logical values / statements (i.e. true, false, true and false, etc.)
    • iteration

If you feel iffy about any of that, a quick google search will help, or just send it and continue.

Outline

This outline will help me guide what I will write and cover. This is a list of the things that I think are important to know as a data scientist.

  • the basics
    • r as a calculator
    • data types
    • functions
    • selecting object indexes
    • tidyverse
    • tibble
    • selecting
    • filter
    • grouping
    • summarising
    • plotting
    • reshaping
    • presenting (Rmd)
  • data visualization
    • univariate
      • histogram
      • density plot (pmf / cmf)
      • violin plot
      • boxplot
    • multivariate
      • scatter plot (2 numeric)
      • bar plot (1 nominal 1 numeric)
      • boxplot (1 nominal 1 numeric)
      • violin plot (1 nominal 1 numeric)
  • map making (GIS)
    • projections
    • coordinate reference systems
    • vector data
      • shapefiles
      • sf package
      • oh looks, its a dataframe
      • making the map
    • density maps
      • point density
      • kernel density
  • statistical modeling (machine learning)
    • supervised learning
      • Linear Regression (continuous prediction)
        • evaluation: R^2, MSE, RMSE, MAE
      • Logistic Regression (classification)
        • evaluation: confusion matrix, auc, roc
  • data preprocessing
    • cross validation (k-fold)
      • train, test, validate
    • scaling & centering (standardization)
    • normalization
    • imputation
    • dimensionality reduction
    • pca / lda
    • backward feature selection
  • supervised continued
    • k-nearest neighbors
    • Support Vector Machines (classification)
    • Naive bayes (classification)
    • random forest (regression / classification)
    • gradient boosted trees (regression / classification)
  • Unsupervised learning
    • hierarchical clustering (classification)
    • k-means
    • neural networks
  • natual language processing / text mining