MPP: Data Science

Find insights and solve business problems.

Log in to Enroll

8 Data Science Skills. 1.5 Million Jobs.

Opportunities for data scientists—one of today’s hottest jobs—are rapidly growing in response to the exponential amounts of data being captured and analyzed. Organizations hire data scientists to find insights and to solve meaningful business problems.

This track has 11 courses, for some courses there are multiple options.

Summary

Who takes this course

Starting and experienced IT and business professionals

Difficulty

Advanced

Assessment

The courses have a final assessment with re-take restrictions. Successfully completing the final assessment will enable you to redeem your certificate of completion.

Certification

The Microsoft Professional Program is completed by completing the Capstone project. There isn’t an exam to complete this track.

Completion time

322-488 hours

Curriculum

Introduction to Data Science

Get started on your Data Science journey.

Summary

Length
6 weeks (2 to 4 hours per week)
Level
Introductory
Language
English

About this course

Learn what it takes to become a data scientist.
This is the first stop in the Data Science curriculum from Microsoft. It will help you get started with the program, plan your learning schedule, and connect with fellow students and teaching assistants. Along the way, you’ll get an introduction to working with and exploring data using a variety of visualisation, analytical, and statistical techniques.

What you’ll learn

  • How the Microsoft Data Science curriculum works
  • How to navigate the curriculum and plan your course schedule
  • Basic data exploration and visualization techniques in Microsoft Excel
  • Foundational statistics that can be used to analyze data

Prerequisites

  • A good understanding of modern marketing in B2B
  • A good understanding of sales and business acumen
Analyzing and Visualizing Data (2 training options available)
Analyzing and Visualizing Data with Excel (Option 1)

Develop your skills with Excel, one of the common tools that data scientists depend on to gather, transform, analyze, and visualize data.

Summary

Length
6 weeks (2 to 4 hours per week)
Level
Intermediate
Language
English

About this course

Excel is one of the most widely used solutions for analyzing and visualizing data. It now includes tools that enable the analysis of more data, with improved visualizations and more sophisticated business logics. In this data science course, you will get an introduction to the latest versions of these new tools in Excel 2016 from an expert on the Excel Product Team at Microsoft.

Learn how to import data from different sources, create mashups between data sources, and prepare data for analysis. After preparing the data, find out how business calculations can be expressed using the DAX calculation engine. See how the data can be visualized and shared to the Power BI cloud service, after which it can be used in dashboards, queried using plain English sentences, and even consumed on mobile devices.

Do you feel that the contents of this course is a bit too advanced for you and you need to fill some gaps in your Excel knowledge? Do you need a better understanding of how pivot tables, pivot charts and slicers work together, and help in creating dashboards? If so, check out DAT205x: Introduction to Data Analysis using Excel.

What you’ll learn

  • Gather and transform data from multiple sources
  • Discover and combine data in mashups
  • Learn about data model creation
  • Explore, analyze, and visualize data

Prerequisites

Students who take this training should understand:

  • Understanding of Excel analytic tools such as tables, pivot tables and pivot charts. Also, some experience in working with data from databases and also from text files will be helpful.

System Requirements:

  • Windows operating system: Windows 7 or later.
  • Microsoft Excel on Windows operating system:
  • Microsoft Excel 2016 Professional Plus or standalone edition
  • Microsoft Excel 2013 Professional Plus or standalone edition
  • Microsoft Excel 2010
  • Other versions of Microsoft Excel are not supported
Analyzing and Visualizing Data with Power BI (Option 2)

Learn Power BI, a powerful cloud-based service that helps data scientists visualize and share insights from their data.

Summary

Length
6 weeks (2 to 4 hours per week)
Level
Introductory
Language
English

About this course

Power BI is quickly gaining popularity among professionals in data science as a cloud-based service that helps them easily visualize and share insights from their organizations’ data.

In this data science course, you will learn from the Power BI product team at Microsoft with a series of short, lecture-based videos, complete with demos, quizzes, and hands-on labs. You’ll walk through Power BI, end to end, starting from how to connect to and import your data, author reports using Power BI Desktop, and publish those reports to the Power BI service. Plus, learn to create dashboards and share with business users—on the web and on mobile devices.

What you’ll learn

  • Connect, import, shape, and transform data for business intelligence (BI)
  • Visualize data, author reports, and schedule automated refresh of your reports
  • Create and share dashboards based on reports in Power BI desktop and Excel
  • Use natural language queries
  • Create real-time dashboards

Prerequisites

Students who take this training should understand:

  • Some experience in working with data from Excel, databases, or text files.

Course Syllabus

Students who take this training should understand:

Week 1

  • Understanding key concepts in business intelligence, data analysis, and data visualization
  • Importing your data and automatically creating dashboards from services such as Marketo, Salesforce, and Google Analytics
  • Connecting to and importing your data, then shaping and transforming that data
  • Enriching your data with business calculations

Week 2

  • Visualizing your data and authoring reports
  • Scheduling automated refresh of your reports
  • Creating dashboards based on reports and natural language queries
  • Sharing dashboards across your organization
  • Consuming dashboards in mobile apps

Week 3

  • Leveraging your Excel reports within Power BI
  • Creating custom visualizations that you can use in dashboards and reports
  • Collaborating within groups to author reports and dashboards
  • Sharing dashboards effectively based on your organization’s needs

Week 4

  • Exploring live connections to data with Power BI
  • Connecting directly to SQL Azure, HD Spark, and SQL Server Analysis Services
  • Introduction to Power BI Development API
  • Leveraging custom visuals in Power BI
Analytics Storytelling for Impact

Learn the art and science of data storytelling and achieve greater analytics impact.

Summary

Length
6 weeks(2 to 4 hours per week)
Level
Intermediate
Language
English

About this course

All analytics work begins and ends with a story. Storytelling with data is the analytics professional’s missing link in delivering the essence of date signals and insights to executives, management, and other stakeholders.

In this analytics storytelling course, you’ll learn effective strategies and tools to master data communication in the most impactful way possible—through well-crafted analytics stories.

You’ll explore what a story is and, perhaps more importantly, what a story is not. Find out how stories create value and why they matter. Learn to craft stories, command the room, finish strong, and assess your impact. Get practical help applying these ideas to your data analytics work. Plus, you’ll learn guidelines and best practices for creating high-impact reports and presentations.

What you’ll learn

  • How to apply storytelling principles to your analytics work
  • How to improve your analytics presentations through storytelling
  • Guidelines and best practices for creating high-impact reports and presentations

Prerequisites

  • One of the following edX courses or equivalent knowledge and skills:
    • Analyzing and Visualizing Data with Excel
    • Analyzing and Visualizing Data with Power BI
  • Working knowledge of PowerPoint.
Ethics and Law in Data and Analytics

Analytics and AI are powerful tools that have real-word outcomes. Learn how to apply practical, ethical, and legal constructs and scenarios so that you can be an effective analytics professional.

Summary

Length
6 weeks (2 to 3 hours per week)
Level
Intermediate
Language
English

About this course

Corporations, governments, and individuals have powerful tools in Analytics and AI to create real-world outcomes, for good or for ill.

Data professionals today need both the frameworks and the methods in their job to achieve optimal results while being good stewards of their critical role in society today.

In this course, you’ll learn to apply ethical and legal frameworks to initiatives in the data profession. You’ll explore practical approaches to data and analytics problems posed by work in Big Data, Data Science, and AI. You’ll also investigate applied data methods for ethical and legal work in Analytics and AI.

What you’ll learn

  • Foundational abilities in applying ethical and legal frameworks for the data profession
  • Practical approaches to data and analytics problems, including Big Data and Data Science and AI
  • Applied data methods for ethical and legal work in Analytics and AI
Querying Data with Transact-SQL

From querying and modifying data in SQL Server or Azure SQL to programming with Transact-SQL, learn essential skills that employers need.

Summary

Length
6 weeks (4 to 5 hours per week)
Level
Intermediate
Language
English

About this course

Transact-SQL is an essential skill for data professionals and developers working with SQL databases. With this combination of expert instruction, demonstrations, and practical labs, step from your first SELECT statement through to implementing transactional programmatic logic.

Work through multiple modules, each of which explore a key area of the Transact-SQL language, with a focus on querying and modifying data in Microsoft SQL Server or Azure SQL Database. The labs in this course use a sample database that can be deployed easily in Azure SQL Database, so you get hands-on experience with Transact-SQL without installing or configuring a database server.

What you’ll learn

  • Create Transact-SQL SELECT queries
  • Work with data types and NULL
  • Query multiple tables with JOIN
  • Explore set operators
  • Use functions and aggregate data
  • Work with subqueries and APPLY
  • Use table expressions
  • Group sets and pivot data
  • Modify data
  • Program with Transact-SQL
  • Implement error handling and transactions

Prerequisites

Students who take this training should understand:

  • Create Transact-SQL SELECT queries
  • Work with data types and NULL
  • Query multiple tables with JOIN
  • Explore set operators
  • Use functions and aggregate data
  • Work with subqueries and APPLY
  • Use table expressions
  • Group sets and pivot data
  • Modify data
  • Program with Transact-SQL
  • Implement error handling and transactions
Data Science - Introduction to R and Python (2 training options available)

Introduction to R for Data Science (Option 1)

Learn the R statistical programming language, the lingua franca of data science in this hands-on course.

Summary

Length
4 weeks(2 to 3 hours per week)
Level
Introductory
Language
English

About this course

R is rapidly becoming the leading language in data science and statistics. Today, R is the tool of choice for data science professionals in every industry and field. Whether you are full-time number cruncher, or just the occasional data analyst, R will suit your needs.

This introduction to R programming course will help you master the basics of R. In seven sections, you will cover its basic syntax, making you ready to undertake your own first data analysis using R. Starting from variables and basic operations, you will eventually learn how to handle data structures such as vectors, matrices, data frames and lists. In the final section, you will dive deeper into the graphical capabilities of R, and create your own stunning data visualisations. No prior knowledge in programming or data science is required.

What makes this course unique is that you will continuously practice your newly acquired skills through interactive in-browser coding challenges using the DataCamp platform. Instead of passively watching videos, you will solve real data problems while receiving instant and personalised feedback that guides you to the correct solution.

What you’ll learn

  • Introductory R language fundamentals and basic syntax
  • What R is and how it’s used to perform data analysis
  • Become familiar with the major R data structures
  • Create your own visualizations using R

Prerequisites

  • None, but previous experience in basic mathematics is helpful.

Prerequisites

Course Syllabus

Section 1: Introduction to Basics

Take your first steps with R. Discover the basic data types in R and assign your first variable.

Section 2: Vectors

Analyze gambling behaviour using vectors. Create, name and select elements from vectors.

Section 3: Matrices

Learn how to work with matrices in R. Do basic computations with them and demonstrate your knowledge by analyzing the Star Wars box office figures.

Section 4: Factors

R stores categorical data in factors. Learn how to create, subset and compare categorical data.

Section 5: Data Frames

When working R, you’ll probably deal with Data Frames all the time. Therefore, you need to know how to create one, select the most interesting parts of it, and order them.

Section 6: Lists

Lists allow you to store components of different types. Section 6 will show you how to deal with lists.

Section 7: Basic Graphics

Discover R’s packages to do graphics and create your own data visualizations.

Introduction to Python for Data Science (Option 2)

The ability to analyse data with Python is critical in data science. Learn the basics, and move on to create stunning visualisations.

Summary

Length
6 weeks (2 to 4 hours per week)
Level
Introductory
Language
English

About this course

Python is a very powerful programming language used for many different applications. Over time, the huge community around this open source language has created quite a few tools to efficiently work with Python. In recent years, a number of tools have been built specifically for data science. As a result, analysing data with Python has never been easier.

In this practical course, you will start from the very beginning, with basic arithmetic and variables, and learn how to handle data structures, such as Python lists, Numpy arrays, and Pandas DataFrames. Along the way, you’ll learn about Python functions and control flow. Plus, you’ll look at the world of data visualisations with Python and create your own stunning visualisations based on real data.

What you’ll learn

  • Explore Python language fundamentals, including basic syntax, variables, and types
  • Create and manipulate regular Python lists
  • Use functions and import packages
  • Build Numpy arrays, and perform interesting calculations
  • Create and customize plots on real data
  • Supercharge your scripts with control flow, and get to know the Pandas DataFrame

Prerequisites

  • Some experience in working with data from Excel, databases, or text files.

Course Syllabus

Section 1: Python Basics

Take your first steps in the world of Python. Discover the different data types and create your first variable.

Section 2: Python Lists

Get the know the first way to store many different data points under a single name. Create, subset and manipulate Lists in all sorts of ways.

Section 3: Functions and Packages

Learn how to get the most out of other people’s efforts by importing Python packages and calling functions.

Section 4: Numpy

Write superfast code with Numerical Python, a package to efficiently store and do calculations with huge amounts of data.

Section 5: Matplotlib

Create different types of visualisations depending on the message you want to convey. Learn how to build complex and customised plots based on real data.

Section 6: Control flow and Pandas

Write conditional constructs to tweak the execution of your scripts and get to know the Pandas DataFrame: the key data structure for Data Science in Python.

Data Science - Essential Math and Statistics (3 training options available)

Essential Math for Machine Learning: R Edition (Option 1)

Learn the essential mathematical foundations for machine learning and artificial intelligence.

Summary

Length
6 weeks(6 to 8 hours per week)
Level
Intermediate
Language
English

About this course

Want to study machine learning or artificial intelligence, but worried that your math skills may not be up to it? Do words like “algebra’ and “calculus” fill you with dread? Has it been so long since you studied math at school that you’ve forgotten much of what you learned in the first place?

You’re not alone. Machine learning and AI are built on mathematical principles like Calculus, Linear Algebra, Probability, Statistics, and Optimization; and many would-be AI practitioners find this daunting. This course is not designed to make you a mathematician. Rather, it aims to help you learn some essential foundational concepts and the notation used to express them. The course provides a hands-on approach to working with data and applying the techniques you’ve learned.

This course is not a full math curriculum. It’s not designed to replace school or college math education. Instead, it focuses on the key mathematical concepts that you’ll encounter in studies of machine learning. It is designed to fill the gaps for students who missed these key concepts as part of their formal education, or who need to refresh their memories after a long break from studying math.

What you’ll learn

  • Familiarity with Equations, Functions, and Graphs
  • Differentiation and Optimization
  • Vectors and Matrices
  • Statistics and Probability

Prerequisites

To complete this course successfully, you should have:

  • A basic knowledge of math
  • Some programming experience – R is preferred.
  • A willingness to learn through self-paced study.
Essential Math for Machine Learning: Python Edition (Option 2)

Learn the essential mathematical foundations for machine learning and artificial intelligence.

Summary

Length
6 weeks (6 to 8 hours per week)
Level
Intermediate
Language
English

About this course

Want to study machine learning or artificial intelligence, but worried that your math skills may not be up to it? Do words like “algebra’ and “calculus” fill you with dread? Has it been so long since you studied math at school that you’ve forgotten much of what you learned in the first place?

You’re not alone. machine learning and AI are built on mathematical principles like Calculus, Linear Algebra, Probability, Statistics, and Optimisation; and many would-be AI practitioners find this daunting. This course is not designed to make you a mathematician. Rather, it aims to help you learn some essential foundational concepts and the notation used to express them. The course provides a hands-on approach to working with data and applying the techniques you’ve learned.

This course is not a full math curriculum; it’s not designed to replace school or college math education. Instead, it focuses on the key mathematical concepts that you’ll encounter in studies of machine learning. It is designed to fill the gaps for students who missed these key concepts as part of their formal education, or who need to refresh their memories after a long break from studying math.

What you’ll learn

After completing this course, you will be familiar with the following mathematical concepts and techniques:

  • Equations, Functions, and Graphs
  • Differentiation and Optimisation
  • Vectors and Matrices
  • Statistics and Probability

Prerequisites

  • A basic knowledge of math
  • Some programming experience – Python is preferred.
  • A willingness to learn through self-paced study.

Course Syllabus

  • Introduction
  • Equations, Functions, and Graphs
  • Differentiation and Optimisation
  • Vectors and Matrices
  • Statistics and Probability
Essential Statistics for Data Analysis using Excel (Option 3)

Gain a solid understanding of statistics and basic probability, using Excel, and build on your data analysis and data science foundation.

Summary

Length
6 weeks(2 to 4 hours per week)
Level
Intermediate
Language
English

About this course

If you’re considering a career as a data analyst, you need to know about histograms, Pareto charts, Boxplots, Bayes’ theorem, and much more. In this applied statistics course, the second in our Microsoft Excel Data Analyst XSeries, use the powerful tools built into Excel, and explore the core principles of statistics and basic probability—from both the conceptual and applied perspectives. Learn about descriptive statistics, basic probability, random variables, sampling and confidence intervals, and hypothesis testing. And see how to apply these concepts and principles using the environment, functions, and visualizations of Excel.

As a data science pro, the ability to analyze data helps you to make better decisions, and a solid foundation in statistics and basic probability helps you to better understand your data. Using real-world concepts applicable to many industries, including medical, business, sports, insurance, and much more, learn from leading experts why Excel is one of the top tools for data analysis and how its built-in features make Excel a great way to learn essential skills.

Before taking this course, you should be familiar with organizing and summarizing data using Excel analytic tools, such as tables, pivot tables, and pivot charts. You should also be comfortable (or willing to try) creating complex formulas and visualizations. Want to start with the basics? Check out DAT205x: Introduction to Data Analysis using Excel. As you learn these concepts and get more experience with this powerful tool that can be extremely helpful in your journey as a data analyst or data scientist, you may want to also take the third course in our series, DAT206x Analyzing and Visualizing Data with Excel. This course includes excerpts from Microsoft Excel 2016: Data Analysis and Business Modeling from Microsoft Press and authored by course instructor Wayne Winston.

What you’ll learn

  • Descriptive statistics
  • Basic probability
  • Random variables
  • Sampling and confidence intervals
  • Hypothesis testing

Prerequisites

  • Secondary school (high school) algebra
  • Ability to work with tables, formulas, and charts in Excel
  • Ability to organize and summarize data using Excel analytic tools
  • such as tables, pivot tables, and pivot charts
  • Excel 2016 is required for the full course experience. Excel 2013 will work but will not support all the visualizations and functions
Data Science - Data Science Research Methods (2 training options available)

Data Science Research Methods: Python Edition (Option 1)

Get hands-on experience with the science and research aspects of data science work, from setting up a proper data study to making valid claims and inferences from data experiments.

Summary

Length
6 weeks (2 to 3 hours per week)
Level
Intermediate
Language
English

About this course

Data scientists are often trained in the analysis of data. However, the goal of data science is to produce a good understanding of some problem or idea and build useful models on this understanding. Because of the principle of “garbage in, garbage out,” it is vital that a data scientist know how to evaluate the quality of information that comes into a data analysis. This is especially the case when data are collected specifically for some analysis (e.g., a survey).

In this course, you will learn the fundamentals of the research process—from developing a good question to designing good data collection strategies to putting results in context. Although a data scientist may often play a key part in data analysis, the entire research process must work cohesively for valid insights to be gleaned.

Developed as a powerful and flexible language used in everything from Data Science to cutting-edge and scalable Artificial Intelligence solutions, Python has become an essential tool for doing Data Science and Machine Learning. With this edition of Data Science Research Methods, all of the labs are done with Python, while the videos are language-agnostic. If you prefer your Data Science to be done with R, please see Data Science Research Methods: R Edition.

What you’ll learn

After completing this course, you will be familiar with the following concepts and techniques:

  • Data analysis and inference
  • Data science research design
  • Experimental data analysis and modeling

Prerequisites

  • A basic knowledge of math
  • Some programming experience – Python is preferred.
  • A willingness to learn through self-paced study.
Data Science Research Methods: R Edition (Option 2)

Get hands-on experience with the science and research aspects of data science work, from setting up a proper data study to making valid claims and inferences from data experiments.

Summary

Length
6 weeks(2 to 3 hours per week)
Level
Intermediate
Language
English

About this course

Data scientists are often trained in the analysis of data. However, the goal of data science is to produce good understanding of some problem or idea and build useful models on this understanding. Because of the principle of “garbage in, garbage out,” it is vital that the data scientist know how to evaluate the quality of information that comes into a data analysis. This is especially the case when data are collected specifically for some analysis (e.g., a survey).

In this course, you will learn the fundamentals of the research process—from developing a good question to designing good data collection strategies to putting results in context. Although the data scientist may often play a key part in data analysis, the entire research process must work cohesively for valid insights to be gleaned.

Developed as a language with statistical analysis and modeling in mind, R has become an essential tool for doing real-world Data Science. With this edition of Data Science Research Methods, all of the labs are done with R, while the videos are tool-agnostic. If you prefer your Data Science to be done with Python, please see Data Science Research Methods: Python Edition.

What you’ll learn

  • Descriptive statistics
  • Basic probability
  • Random variables
  • Sampling and confidence intervals
  • Hypothesis testing

Prerequisites

  • Secondary school (high school) algebra
  • Ability to work with tables, formulas, and charts in Excel
  • Ability to organize and summarize data using Excel analytic tools
  • such as tables, pivot tables, and pivot charts
  • Excel 2016 is required for the full course experience. Excel 2013 will work but will not support all the visualizations and functions
Data Science - Principles of Machine Learning (2 training options available)

Principles of Machine Learning: R Edition (Option 1)

Get hands-on experience building and deriving insights from machine learning models using R and Azure Notebooks.

Summary

Length
6 weeks(6 to 8 hours per week)
Level
Intermediate
Language
English

About this course

Machine learning uses computers to run predictive models that learn from existing data in order to forecast future behaviours, outcomes, and trends.

In this data science course, you will be given clear explanations of machine learning theory combined with practical scenarios and hands-on experience building, validating, and deploying machine learning models. You will learn how to build and derive insights from these models using R, and Azure Notebooks.

What you’ll learn

  • Data exploration, preparation and cleaning
  • Supervised machine learning techniques
  • Unsupervised machine learning techniques
  • Model performance improvement

Prerequisites

  • A basic knowledge of math
  • Some programming experience – R is preferred.
  • A willingness to learn through self-paced study.
Principles of Machine Learning: Python Edition (Option 2)

Get hands-on experience building and deriving insights from machine learning models using Python and Azure Notebooks.

Summary

Length
6 weeks (6 to 8 hours per week)
Level
Intermediate
Language
English

About this course

Machine learning uses computers to run predictive models that learn from existing data in order to forecast future behaviours, outcomes, and trends.

In this data science course, you will be given clear explanations of machine learning theory combined with practical scenarios and hands-on experience building, validating, and deploying machine learning models. You will learn how to build and derive insights from these models using Python, and Azure Notebooks.

What you’ll learn

  • Data exploration, preparation and cleaning
  • Supervised machine learning techniques
  • Unsupervised machine learning techniques
  • Model performance improvement

Prerequisites

  • A basic knowledge of math
  • Some programming experience – Python is preferred.
  • A willingness to learn through self-paced study.
Data Science - Big Data and Predictive Analytics (3 training options available)

Developing Big Data Solutions with Azure Machine Learning (Option 1)

Learn how to build predictive solutions for big data using Microsoft Azure Machine Learning

Summary

Length
4 weeks (3 to 4 hours per week)
Level
Intermediate
Language
English

About this course

The past can often be the key to predicting the future. Big data from historical sources is a valuable resource for identifying trends and building machine learning models that apply statistical patterns and predict future outcomes.

This course introduces Azure Machine Learning, and explores techniques and considerations for using it to build models from big data sources, and to integrate predictive insights into big data processing workflows.

What you’ll learn

  • How to create predictive web services with Azure Machine Learning
  • How to work with big data sources in Azure Machine Learning
  • How to integrate Azure Machine Learning into big data batch processing pipelines
  • How to integrate Azure Machine Learning into real-time big data processing solutions

Prerequisites

Students who take this training should understand:

This course assumes some knowledge of:

  • Building data processing pipelines with Azure Data Factory
  • Building real-time data processing solutions with Azure Stream Analytics

Course Syllabus

Students who take this training should understand:

This course assumes some knowledge of:

  • Module 1: Introduction to Azure Machine Learning
  • Module 2: Building Predictive Models with Azure Machine Learning
  • Module 3: Operationalizing Machine Learning Models
  • Module 4: Using Azure Machine Learning in Big Data Solutions
  • Analysing Big Data with Microsoft R (Option 2)

    Learn how to use Microsoft R Server to analyze large datasets using R, one of the most powerful programming languages.

    Summary

    Length
    4 weeks(2 to 4 hours per week)
    Level
    Intermediate
    Language
    English

    About this course

    The open-source programming language R has for a long time been popular (particularly in academia) for data processing and statistical analysis. Among R’s strengths are that it’s a succinct programming language and has an extensive repository of third party libraries for performing all kinds of analyses. Together, these two features make it possible for a data scientist to very quickly go from raw data to summaries, charts, and even full-blown reports. However, one deficiency with R is that traditionally it uses a lot of memory, both because it needs to load a copy of the data in its entirety as a data.frame object, and also because processing the data often involves making further copies (sometimes referred to as copy-on-modify). This is one of the reasons R has been more reluctantly received by industry compared to academia.

    The main component of Microsoft R Server (MRS) is the RevoScaleR package, which is an R library that offers a set of functionalities for processing large datasets without having to load them all at once in the memory. RevoScaleR offers a rich set of distributed statistical and machine learning algorithms, which get added to over time. Finally, RevoScaleR also offers a mechanism by which we can take code that we developed on our laptop and deploy it on a remote server such as SQL Server or Spark (where the infrastructure is very different under the hood), with minimal effort.

    In this course, we will show you how to use MRS to run an analysis on a large dataset and provide some examples of how to deploy it on a Spark cluster or a SQL Server database. Upon completion, you will know how to use R for big-data problems.
    Since RevoScaleR is an R package, we assume that the course participants are familiar with R. A solid understanding of R data structures (vectors, matrices, lists, data frames, environments) is required. Familiarity with 3rd party packages such as dplyr is also helpful.

    What you’ll learn

    You will learn how to use MRS to read, process, and analyze large datasets including:

    • Read data from flat files into R’s data frame object, investigate the structure of the dataset and make corrections, and store prepared datasets for later use
    • Prepare and transform the data
    • Calculate essential summary statistics, do cross tabulation, write your own summary functions, and visualize data with the ggplot2 package
    • Build predictive models, evaluate and compare models, and generate predictions on new data

    Prerequisites

    • Familiarity with R
    Implementing Predictive Analytics with Spark in Azure HDInsight (Option 3)

    Learn how to use Spark in Microsoft Azure HDInsight to create predictive analytics and machine learning solutions.

    Summary

    Length
    6 weeks(3 to 4 hours per week)
    Level
    Intermediate
    Language
    English

    About this course

    Are you ready for big data science? In this course, learn how to implement predictive analytics solutions for big data using Apache Spark in Microsoft Azure HDInsight. See how to work with Scala or Python to cleanse and transform data and build machine learning models with Spark ML (the machine learning library in Spark).

    Note: To complete the hands-on elements in this course, you will require an Azure subscription and a Windows client computer. You can sign up for a free Azure trial subscription (a valid credit card is required for verification, but you will not be charged for Azure services). Note that the free trial is not available in all regions.

    What you’ll learn

    • Using Spark to explore data and prepare for modeling
    • Build supervised machine learning models
    • Evaluate and optimize models
    • Build recommenders and unsupervised machine learning models

    Prerequisites

    • Familiarity with Azure HDInsight.
    • Familiarity with databases and SQL.
    • Some programming experience.
    • A willingness to learn actively in a self-paced manner.

    Course Syllabus

    • Introduction to Data Science with Spark

      Get started with Spark clusters in Azure HDInsight, and use Spark to run Python or Scala code to work with data.

    • Getting Started with Machine Learning

      Learn how to build classification and regression models using the Spark ML library.

    • Evaluating Machine Learning Models

      Learn how to evaluate supervised learning models, and how to optimize model parameters.

    • Recommenders and Unsupervised Models

      Learn how to build recommenders and clustering models using Spark ML.

    Microsoft Professional Capstone: Data Science

    The capstone project is offered directly by Microsoft and can only be done once per quarter: in January, April, July and October.

    Enroll for the full MPP track here in the month prior to the one the capstone starts in, using the Microsoft account you used to register on Azure Academy so that your progress is synced.

    To have your progress synced with Azure Academy and to be eligible for the capstone project you have to have a Certificate of Completion for each one of 10 required courses from Azure Academy.

    Need help?

    If you have questions about our courses, check our FAQs or get in touch with us here.