کتاب «Practical Statistics for Data Scientists» یک منبع جامع و کاربردی برای معرفی مفاهیم و تکنیکهای آماری ضروری در حوزههای تحلیل دادهها، یادگیری ماشین و هوش مصنوعی است. در این کتاب، مفاهیم آماری به شیوهای قابل فهم و با استفاده از سناریوهای واقعی علم داده و کدهای نمونه به زبانهای R و پایتون ارائه شده است.
جزئیات فصلهای هفتگانهی کتاب به صورت زیر است.
1. Exploratory Data Analysis
- Elements of Structured Data
- Rectangular Data
+ Data Frames and Indexes
+ Nonrectangular Data Structures
- Estimates of Location
+ Mean
+ Median and Robust Estimates
+ Example: Location Estimates of Population and Murder Rates
- Estimates of Variability
+ Standard Deviation and Related Estimates
+ Estimates Based on Percentiles
+ Example: Variability Estimates of State Population
- Exploring the Data Distribution
+ Percentiles and Boxplots
+ Frequency Tables and Histograms
+ Density Plots and Estimates
- Exploring Binary and Categorical Data
+ Mode
+ Expected Value
+ Probability
- Correlation
+ Scatterplots
- Exploring Two or More Variables
+ Hexagonal Binning and Contours (Plotting Numeric Versus Numeric Data)
+ Two Categorical Variables
+ Categorical and Numeric Data
+ Visualizing Multiple Variables
2. Data and Sampling Distributions
- Random Sampling and Sample Bias
+ Bias
+ Random Selection
+ Size Versus Quality: When Does Size Matter?
+ Sample Mean Versus Population Mean
- Selection Bias
+ Regression to the Mean
- Sampling Distribution of a Statistic
+ Central Limit Theorem
+ Standard Error
- The Bootstrap
+ Resampling Versus Bootstrapping
- Confidence Intervals
- Normal Distribution
+ Standard Normal and QQ-Plots
- Long-Tailed Distributions
- Student’s t-Distribution
- Binomial Distribution
- Chi-Square Distribution
- F-Distribution
- Poisson and Related Distributions
+ Poisson Distributions
+ Exponential Distribution
+ Estimating the Failure Rate
+ Weibull Distribution
3. Statistical Experiments and Significance Testing
- A/B Testing
+ Why Have a Control Group?
+ Why Just A/B? Why Not C, D,...?
- Hypothesis Tests
+ The Null Hypothesis
+ Alternative Hypothesis
+ One-Way Versus Two-Way Hypothesis Tests
- Resampling
+ Permutation Test
+ Example: Web Stickiness
+ Exhaustive and Bootstrap Permutation Tests
+ Permutation Tests: The Bottom Line for Data Science
- Statistical Significance and p-Values
+ p-Value
+ Alpha
+ Type 1 and Type 2 Errors
+ Data Science and p-Values
- t-Tests
- Multiple Testing
- Degrees of Freedom
- ANOVA
+ F-Statistic
+ Two-Way ANOVA
- Chi-Square Test
+ Chi-Square Test: A Resampling Approach
+ Chi-Square Test: Statistical Theory
+ Fisher’s Exact Test
+ Relevance for Data Science
- Multi-Arm Bandit Algorithm
- Power and Sample Size
+ Sample Size
4. Regression and Prediction
- Simple Linear Regression
+ The Regression Equation
+ Fitted Values and Residuals
+ Least Squares
+ Prediction Versus Explanation (Profiling)
- Multiple Linear Regression
+ Example: King County Housing Data
+ Assessing the Model
+ Cross-Validation
+ Model Selection and Stepwise Regression
+ Weighted Regression
- Prediction Using Regression
+ The Dangers of Extrapolation
+ Confidence and Prediction Intervals
- Factor Variables in Regression
+ Dummy Variables Representation
+ Factor Variables with Many Levels
+ Ordered Factor Variables
- Interpreting the Regression Equation
+ Correlated Predictors
+ Multicollinearity
+ Confounding Variables
+ Interactions and Main Effects
- Regression Diagnostics
+ Outliers
+ Influential Values
+ Heteroskedasticity, Non-Normality, and Correlated Errors
+ Partial Residual Plots and Nonlinearity
- Polynomial and Spline Regression
+ Polynomial
+ Splines
+ Generalized Additive Models
5. Classification
- Naive Bayes
+ Why Exact Bayesian Classification Is Impractical
+ The Naive Solution
+ Numeric Predictor Variables
- Discriminant Analysis
+ Covariance Matrix
+ Fisher’s Linear Discriminant
+ A Simple Example
- Logistic Regression
+ Logistic Response Function and Logit
+ Logistic Regression and the GLM
+ Generalized Linear Models
+ Predicted Values from Logistic Regression
+ Interpreting the Coefficients and Odds Ratios
+ Linear and Logistic Regression: Similarities and Differences
+ Assessing the Model
- Evaluating Classification Models
+ Confusion Matrix
+ The Rare Class Problem
+ Precision, Recall, and Specificity
+ ROC Curve
+ AUC
+ Lift
- Strategies for Imbalanced Data
+ Undersampling
+ Oversampling and Up/Down Weighting
+ Data Generation
+ Cost-Based Classification
+ Exploring the Predictions
6. Statistical Machine Learning
- K-Nearest Neighbors
+ A Small Example: Predicting Loan Default
+ Distance Metrics
+ One Hot Encoder
+ Standardization (Normalization, z-Scores)
+ Choosing K
+ KNN as a Feature Engine
- Tree Models
+ A Simple Example
+ The Recursive Partitioning Algorithm
+ Measuring Homogeneity or Impurity
+ Stopping the Tree from Growing
+ Predicting a Continuous Value
+ How Trees Are Used
- Bagging and the Random Forest
+ Bagging
+ Random Forest
+ Variable Importance
+ Hyperparameters
- Boosting
+ The Boosting Algorithm
+ XGBoost
+ Regularization: Avoiding Overfitting
+ Hyperparameters and Cross-Validation
7. Unsupervised Learning
- Principal Components Analysis
+ A Simple Example
+ Computing the Principal Components
+ Interpreting Principal Components
+ Correspondence Analysis
- K-Means Clustering
+ A Simple Example
+ K-Means Algorithm
+ Interpreting the Clusters
+ Selecting the Number of Clusters
- Hierarchical Clustering
+ A Simple Example
+ The Dendrogram
+ The Agglomerative Algorithm
+ Measures of Dissimilarity
- Model-Based Clustering
+ Multivariate Normal Distribution
+ Mixtures of Normals
+ Selecting the Number of Clusters
- Scaling and Categorical Variables
+ Scaling the Variables
+ Dominant Variables
+ Categorical Data and Gower’s Distance
+ Problems with Clustering Mixed Data
همانگونه که از این عناوین مشخص است، تمرکز کتاب روی بحثهای آماری حوزهی تحلیل داده و یادگیری ماشین بوده و یک منبع ارزشمند برای افرادی است که به دنبال یادگیری و استفاده از آمار در علم داده و هوش مصنوعی هستند.
نسخهی PDF ویراست دوم این کتاب از این پیوند قابل دریافت است.