Practical Data Science with Rafal Lukawiecki - Cortana Analytics: Azure Machine Learning, SQL Data Mining and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Möt och lär av en internationellt erkänd mästare! Nu har du möjlighet att gå en exklusiv masterclass av och med SQL- och BI-gurun Rafal Lukawiecki. Utbildningen är en intensiv femdagarskurs där du lär dig det senaste inom Azure Machine Learning, SQL data mining och Revolution Analytics R Software.

There is much readily available information about algorithms, deep-learning frameworks, or stastistical software packages, but how do you put it all together to solve a real-world problem with data science? This course will teach you about the tools you need, but above it all, it will also carefully explain the working methods and processes that successful data scientists use. Not only will you know the algorithms, but you will also know how—and when—to start and finish your projects, or which ones are likely to succeed but only with significant extra effort.

You will learn machine learning, data mining, some statistics, data preparation, and how to interpret the results. You will see how to formulate business questions in terms of data science hypotheses and experiments, and how to prepare inputs to answer those questions. We will cover common issues and mistakes, how to resolve them, like overtraining, and how to cope with rare events, such as fraud. At the end of this course you will be able to plan and run data science projects.

As a practicing data miner, Rafal will share his decade of hands-on experience while teaching you about Azure Machine Learning (Azure ML) which is the foundation of Cortana Intelligence Suite, and its highly-visual, on-premise companion, the SQL Server Analysis Services Data Mining engine, supplemented with the free Microsoft R Open and license-based Microsoft SQL Server 2016 R Services and R Server software. We will use some Excel, however, most of our time will be spent in ML Studio, some in RStudio, SSDT, SSMS, and the Azure Portal.

About Rafal Lukawiecki

As Strategic Consultant at Project Botticelli Ltd (projectbotticelli.com), Rafal focuses on making advanced analytics easy, insightful, and useful, helping clients achieve better organizational performance. Passing those skills to consultants, developers, and board members is important to him. He specializes in business intelligence, looking for valuable patterns and correlations using data mining, and he is also known for his work in cryptography, enterprise architecture, and solution delivery. Rafal has been a popular, well-travelled speaker at major IT conferences since 1998. He even had the honour of sharing keynote platforms with Bill Gates, Neil Armstrong, and Steve Ballmer. A natural educator, he explains complex concepts in simple terms in an engaging, enjoyable, energetic style. Outside IT, Rafal spends a quarter of every year finding abstractions in natural landscapes, expressing them through traditional, black-and-white, large-format lm photography in his hand-made, silver-gelatin prints—see rafal.net.

Target audience

Analysts, power users, predictive and BI developers, database and other professionals who wish to embrace machine learning, budding data scientists, consultants.

Prerequisites

No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all: prepare questions that you would like to answer using predictive analytics and machine learning.

Format

60% lectures, 20% demos, plus 20% time to help you follow the demos and tasks on your own equipment, if you bring a laptop. You will be challenged to find answers to 4 problems during the course, and you will have a chance to build your own models in SSAS, Azure ML and R. Doing that will help you learn, but it is not a requirement: you are quite welcome to sit back, observe the demos, and ask questions. If you bring your own data, you can analyse it. You will get a list of free or evaluation-edition software to preinstall before attending. You will need your own Azure account: free one is OK, but the paid one is better—and it can be inexpensive, or even free during a trial. You can copy course experiments and data into your ML workspace for learning and future reference.

Agenda

Module 1: Data Science Fundamentals
We begin the course with a thorough introduction of all of the key concepts, terminology, components, and tools. Topics covered include:
• Introduction to data science and its components
• Machine learning vs data mining vs artificial intelligence
• Tools landscape
• Statistics
• Big data
• Data wrangling
• Teamwork

Module 2.1: Tools (SQL & R)
Configuring Azure ML in the cloud is effortless. You need to pay a little bit more attention to on-premise R and SQL server environment, to make sure that you can easily access your modelling data. Topics covered include:
• Getting started with and using SSAS DM, and SQL R
• Structures, models, data flows
• Configuration concerns and pricing
• Using Rattle with R and RStudio
• Using SQL Server 2016 R Server and Services
• Getting a feel for the data: interpreting notched boxplots in R

Module 2.2: Tools (R & Azure ML)
• Overview of Cortana Intelligence Suite
• Getting started with and using Azure ML and Cortana R
• Azure requirements and dependencies
• Provisioning workspaces
• Uploading and connecting to SQL Azure data
• Creating and running Azure ML experiments (programs)
• Embedding R in Azure ML

Module 3: Data
Data science requires you to prepare your data into a rather unique, flat, and completely denormalised format. While inputs are always necessary, and you may need to engineer hundreds of them, we do not need predictive outputs in all cases. Topics covered include:
• Inputs and outputs, features and labels
• Data formats, discretization vs continuous
• Cases, observations, signatures
• Feature engineering
• Azure ML data preparation and manipulation modules
• Preparing unstructured text for text analysis
• Feature hashing
• Moving data around and its storage
• Briefly: other Cortana Intelligence Suite tools for data management and storage, including data lakes, BLOBs, and other Hadoop

Module 4: Process
The analytical process consists of problem formulation, data preparation, modelling, validation, and deployment—all in an iterative fashion. You will learn about the CRISP-DM industry-standard approach but the key subject of this module will teach you how to apply the scientific method of reasoning to solve real-world business problems. Notably, you will learn how to start projects by expressing business needs as hypotheses, and how to test them. Topics covered include:
• Stating business question in data science term
• CRISP-DM
• Scientific method of reasoning
• Hypothesis testing and experiments
• Student’s t-test
• Pearson chi-squared test
• Iterative hypothesis refinement

Module 5: Algorithms
There are hundreds of machine learning algorithms, yet they belong to just a dozen of groups, of which 4-5 are in very common use. We will introduce those algorithm classes, and we will discuss some of the most often used examples in each class, while explaining which technology tools (Azure ML, SQL, or R) provide their most convenient implementation. Topics covered include:
• What does data mining do?
• Algorithm classes in Azure ML, R, and SSAS
• Supervised vs Unsupervised learning
• Classifiers
• Clustering
• Regression
• Similarity Matching
• Recommenders

Module 6: Clustering, Segmentation, and Anomaly Detection and Prediction
Segmentation is the main application of unsupervised learning using clustering algorithms. While the action of the algorithm is usually quick and easy to configure, interpreting the results can take a lot of time and intuition. We will spend plenty of time practicing segmentation, interpreting the results and subsequently parameterising the algorithm to provide us with additional insight, and to help you apply it back to your own data. You will even learn how to apply this technique for anomaly (outlier) detection and text analytics! Topics covered include:
• Introduction to segmentation
• Clustering algorithms (k-means, EM, and others)
• Interpreting clusters
• Cluster characteristics
• Discrimination
• Tornado charts
• Using clustering for text analysis
• Anomaly detection with clustering, PCA and SVMs

Module 7: Classification
Without doubt, classifiers are the most important, and the most often used category of machine learning algorithms, and the foundation of algorithmic data science. We will focus on several variants of the most important classifier algorithm—decision tree—while progressively interpreting the results, and improving its performance. After introducing neural networks and logistic regression we will also compare the performance of all of these classifiers on our test dataset. Topics covered include:
• Introduction to classifiers
• Two-class (binary) vs multi-class
• Decision trees, forests, and boosting
• Decision jungles *
• Neural networks and logistic regression
• Overfitting (overtraining) concerns
• Using classifiers for text analysis
• Associative decision trees *

Module 8: Basic Statistics
Basic concepts of statistics, notably: means, medians, modes, and variance or standard deviation, are essential to validating data and model quality. Probability, and the concept of p-values help you decide which of your inputs (features) are more important than others. R makes all of these powerful ideas accessible and visual, while Azure ML enables you to deploy them easily into production. Topics covered include:
• Basic concepts of statistics: population vs sample, measure types, means and dispersion, distributions
• Confidence intervals, p-values
• Correlation
• Descriptive statistics with R
• Basic concepts of probability
• Finding important features using p-values, linear regression and ANOVA *

Module 9: Model Validation
The most important aspect of any data science project is the iterative validation and improvement of the models. Without validation, your models cannot be used. There are several tests of model validity, and we will focus on accuracy and reliability, showing you different ways to measure it. Topics covered include:
• Testing accuracy
• Lift charts
• Testing reliability
• Testing usefulness

Module 10: Classifier Precision
Validation of classifiers is likely to be your main occupation as a data scientist, because classifiers are used so often, and because their precision is not always easy to balance with business requirements, such as restricted resources or required business performance. We will introduce the fundamentals of finding the balance between the acceptable number of false positives and false negatives by using classification (confusion) matrices, and plotting the options using ROC (Receiver Operating Characteristic) charts. Without a moment of doubt, this is the most important module of this entire course. Topics covered include:
• Testing classifiers
• False positives vs. false negatives
• Classification (confusion) matrix
• Precision
• Recall
• Balancing precision with recall vs business goals and constraints
• Charting precision-recall (sensitivity-specificity)
• ROC curves
• Other measures of accuracy
• Cross-validation
• Optimising binary classifier thresholds for a known business goal of prediction quality
• Refining models to improve accuracy and reliability
• Hyperparameter tuning
• Class imbalance problem (fraud analytics and rare event prediction) *

Module 11: Regressions
Considered by some as the numerical equivalent of classifiers, regression is a large subject of its own. We will introduce its simple but a very popular form, linear regression, and the more precise, but also prone-to-overfitting, decision tree variant. Topics covered include:
• Introduction to simple regressions
• Linear regression (classic)
• Regression decision trees and other ensemble regression algorithms
• Relationship to ANOVA *
• Measuring linear regression quality (R-squared, predictor p-values, RMSE, MAE, RAE, RSE, and additional testing using R)

Module 12: Similarity Matching & Recommenders
From basic concepts of similarity matching, through model-based associative analysis, collaborative filtering, to hybrid systems, like the Matchbox algorithm, there are several techniques for building recommenders. You will get a good overview of this subject, as well as an understanding of how to use these techniques for advanced data exploration, such as Market Basket Analysis. Topics covered include:
• Introduction to recommender concepts
• Model-based, similarity-based, and hybrid recommenders
• Association rules
• Understanding itemsets and rules
• Rule importance vs. rule probability
• Data structures for association rules
• Market Basket Analysis
• Collaborative filtering
• Matchbox recommenders
• Validating recommenders

Module 13: Other Algorithms (Brief Overview)
As the course is coming to its end, we will briefly overview some of the remaining and interesting algorithms, without going into much detail, but letting you have an understanding of the existing general approaches. Topics covered include:
• Sequence clustering and Markov chains
• SVM (Support Vector Machines)
• Time series *
• Image recognition *
• Text analysis

Module 14: Production & Model Maintenance
If you plan on using your models for prediction, rather than just for the exploration of data, you need to deploy your models to production and maintain them on an on-going basis. You will learn about the easiest way to do so using Azure ML web services and its REST synchronous and asynchronous APIs, as well as how to deploy and invoke SSAS models by using DMX queries. Topics covered include:
• Deploying models to production
• SSAS models and DMX queries
• Azure ML web services: preparation and publishing
• REST APIs: request/response vs batch
• On-going maintenance and model updates
Please note: we reserve the right to amend the order and the day allocation of the topics and modules to best suit the dynamic character of the class and to answer questions as they arise. Some subjects (marked with an asterisk *) are optional, and will only be covered if time allows.

Boka kursen

Boka din plats redan idag.

Utbildningsformer

Addskills erbjuder mycket mer än traditionell klassrumsutbildning. Se vilka utbildningsformer som passar just dig!

Addskills kunskapsbank

Ta del av intressanta artiklar, webinars och filmade seminarier.

Anpassad företagsutbildning

Behöver du en utbildning som är anpassad till ditt företags behov? Läs mer om våra anpassade företagsutbildningar.

Om kursen

Pris: 39 450,00 kr

exklusive moms

Längd 5 dagar
Kurskod MC027
Boka kursen

Välj ort och kursstart

11 december

Kunduppgifter