MPP: Big Data

Capture, process and analyse data for a data-driven world.

Log in to Enroll

It’s your career, think big.

Designing systems that capture, process, and analyze data is critical for organizations in order to have a competitive advantage. This curriculum takes students from their first select statement to orchestrating big data workflows in the cloud.

Summary

Who takes this course

Aspiring Big Data Solutions builders

Difficulty

Advanced

Assessment

The courses have a final assessment with re-take restrictions. Successfully completing the final assessment will enable you to redeem your certificate of completion.

Certification

The Microsoft Professional Program is completed by completing the Capstone project. There isn’t an exam to complete this track.

Completion time

172-258 hours

Curriculum

Introduction to Big Data

Get started on your journey to building Big Data solutions..

Summary

Length
3 weeks (3 to 4 hours per week)
Level
Introductory
Language
English

About this course

Learn what it takes to build Big Data analytics solutions.

This is the first stop in the Big Data curriculum from Microsoft. It will help you get started with the curriculum, plan your learning schedule, and connect with fellow students and teaching assistants. Along the way, you’ll get an introduction to working with data and some fundamental concepts and technologies for Big Data scenarios.

What you’ll learn

  • How the Microsoft Big Data curriculum works
  • An introduction to data formats, technologies, and techniques
  • Fundamentals of Databases
  • Basic principles for working with Big Data

Course Syllabus

Students who take this training should understand:

  • Module 1: Introduction
  • Module 2: Data Basics
  • Module 3: Fundamentals of Databases
  • Module 4: Introduction to Big Data
Analyze and Visualize Data (2 training options available)

Analyzing and Visualizing Data with Power BI ( Option 1 )

Learn Power BI, a powerful cloud-based service that helps data scientists visualize and share insights from their data.

Summary

Length
6 weeks (2 to 4 hours per week)
Level
Introductory
Language
English

About this course

Power BI is quickly gaining popularity among professionals in data science as a cloud-based service that helps them easily visualize and share insights from their organizations’ data.

In this data science course, you will learn from the Power BI product team at Microsoft with a series of short, lecture-based videos, complete with demos, quizzes, and hands-on labs. You’ll walk through Power BI, end to end, starting from how to connect to and import your data, author reports using Power BI Desktop, and publish those reports to the Power BI service. Plus, learn to create dashboards and share with business users—on the web and on mobile devices.

What you’ll learn

  • Connect, import, shape, and transform data for business intelligence (BI)
  • Visualize data, author reports, and schedule automated refresh of your reports
  • Create and share dashboards based on reports in Power BI desktop and Excel
  • Use natural language queries
  • Create real-time dashboards

Prerequisites

Students who take this training should understand:

  • Some experience in working with data from Excel, databases, or text files.

Course Syllabus

Students who take this training should understand:

Week 1

  • Understanding key concepts in business intelligence, data analysis, and data visualization
  • Importing your data and automatically creating dashboards from services such as Marketo, Salesforce, and Google Analytics
  • Connecting to and importing your data, then shaping and transforming that data
  • Enriching your data with business calculations

Week 2

  • Visualizing your data and authoring reports
  • Scheduling automated refresh of your reports
  • Creating dashboards based on reports and natural language queries
  • Sharing dashboards across your organization
  • Consuming dashboards in mobile apps

Week 3

  • Leveraging your Excel reports within Power BI
  • Creating custom visualizations that you can use in dashboards and reports
  • Collaborating within groups to author reports and dashboards
  • Sharing dashboards effectively based on your organization’s needs

Week 4

  • Exploring live connections to data with Power BI
  • Connecting directly to SQL Azure, HD Spark, and SQL Server Analysis Services
  • Introduction to Power BI Development API
  • Leveraging custom visuals in Power BI
Analyzing and Visualizing Data with Excel ( Option 2 )

Develop your skills with Excel, one of the common tools that data scientists depend on to gather, transform, analyze, and visualize data.

Summary

Length
6 weeks (2 to 4 hours per week)
Level
Intermediate
Language
English

About this course

Excel is one of the most widely used solutions for analyzing and visualizing data. It now includes tools that enable the analysis of more data, with improved visualizations and more sophisticated business logics. In this data science course, you will get an introduction to the latest versions of these new tools in Excel 2016 from an expert on the Excel Product Team at Microsoft.

Learn how to import data from different sources, create mashups between data sources, and prepare data for analysis. After preparing the data, find out how business calculations can be expressed using the DAX calculation engine. See how the data can be visualized and shared to the Power BI cloud service, after which it can be used in dashboards, queried using plain English sentences, and even consumed on mobile devices.

Do you feel that the contents of this course is a bit too advanced for you and you need to fill some gaps in your Excel knowledge? Do you need a better understanding of how pivot tables, pivot charts and slicers work together, and help in creating dashboards? If so, check out DAT205x: Introduction to Data Analysis using Excel.

What you’ll learn

  • Gather and transform data from multiple sources
  • Discover and combine data in mashups
  • Learn about data model creation
  • Explore, analyze, and visualize data

Prerequisites

Students who take this training should understand:

  • Understanding of Excel analytic tools such as tables, pivot tables and pivot charts. Also, some experience in working with data from databases and also from text files will be helpful.

System Requirements:

  • Windows operating system: Windows 7 or later.
  • Microsoft Excel on Windows operating system:
  • Microsoft Excel 2016 Professional Plus or standalone edition
  • Microsoft Excel 2013 Professional Plus or standalone edition
  • Microsoft Excel 2010
  • Other versions of Microsoft Excel are not supported
Work with NoSQL Data (2 training options available)

Developing Planet-Scale Applications in Azure Cosmos DB ( Option 1 )

Gain an in-depth understanding of Azure Cosmos DB—a multi-model, global-scale database from Microsoft that transparently replicates your data wherever your users are.

Summary

Length
4 weeks (2 to 3 hours per week)
Level
Intermediate
Language
English

About this course

If you’re familiar with NoSQL in Azure and the platform’s powerful non-relational data storage options, take the next step! Join us for an in-depth look at developing NoSQL apps in super-scalable Azure Cosmos DB—the distributed, multi-model database from Microsoft that transparently replicates your data wherever your users are. Learn about its broad, global-scale features and capabilities. Then, go deeper into some of the APIs available in Azure Cosmos DB for storing different kinds of NoSQL data.

We’ll start with a look at general concepts, including partitioning schemes, global replications, hierarchy, security, and more, as you learn to develop document, key/value, or graph databases with Cosmos DB using a series of popular APIs and programming models.

Plus, we’ll work with API specifics for DocumentDB, Gremlin, MongoDB, and Tables and conclude with a look at real-world integrations, visualizations, and analyses, such as Spark Connector, Azure Search, Stream Analytics.

What you’ll learn

  • Store structured data using the Tables API
  • Azure Cosmos DB core features & capabilities
  • Develop databases with the DocumentDB API
  • Build graph databases with the Gremlin API
  • Build databases using the MongoDB API
  • Integrate Cosmos DB into other database services
Introduction to NoSQL Data Solutions ( Option 2 )

Learn the fundamentals of NoSQL and explore several non-relational data storage options in Microsoft Azure.

Summary

Length
3 weeks (2 to 3 hours per week)
Level
Intermediate
Language
English

About this course

As a data pro, you know that some scenarios—particularly those involving real-time analytics, site personalization, IoT, and mobile apps—are better addressed with NoSQL storage and compute solutions than they are with relational databases. Microsoft Azure has several NoSQL (or “Not Only SQL”) non-relational data storage options to choose from. NoSQL databases are generally built to be distributed and partitioned across many servers. And they’re built to scale out for high availability and to be flexible enough to handle semi-structured and unstructured data. If you have a data model that is constantly evolving and you want to move fast, that’s what these databases are about.

In this practical course, complete with labs, assessments, and a final exam, join the experts to learn how NoSQL has evolved over time. Explore non-relational data storage options in Azure, and see how to use them in your applications. Find out how to create, store, manage, and access data in these different storage options. Get an in-depth look at Azure Table Storage, DocumentDB, MongoDB, and more. Learn about the “three Vs”—variety (schemas or scenarios that evolve quickly), volume (scale in terms of data storage), and velocity (throughput needs to support a large user base). Take this opportunity to get hands-on with NoSQL options in Azure.

What you’ll learn

  • NoSQL fundamentals
  • NoSQL options in Microsoft Azure
  • Core techniques for using DocumentDB, Azure Table Storage, and MongoDB
  • Other techniques for accessing and improving performance of your NoSQL storage

Prerequisites

Students who take this training should understand:

  • Relational Database Fundamentals
  • T-SQL Querying
  • Basic understanding of HTTP APIs and Requests
Querying Data with Transact-SQL

From querying and modifying data in SQL Server or Azure SQL to programming with Transact-SQL, learn essential skills that employers need.

Summary

Length
6 weeks (4 to 5 hours per week)
Level
Intermediate
Language
English

About this course

Transact-SQL is an essential skill for data professionals and developers working with SQL databases. With this combination of expert instruction, demonstrations, and practical labs, step from your first SELECT statement through to implementing transactional programmatic logic.

Work through multiple modules, each of which explore a key area of the Transact-SQL language, with a focus on querying and modifying data in Microsoft SQL Server or Azure SQL Database. The labs in this course use a sample database that can be deployed easily in Azure SQL Database, so you get hands-on experience with Transact-SQL without installing or configuring a database server.

What you’ll learn

  • Create Transact-SQL SELECT queries
  • Work with data types and NULL
  • Query multiple tables with JOIN
  • Explore set operators
  • Use functions and aggregate data
  • Work with subqueries and APPLY
  • Use table expressions
  • Group sets and pivot data
  • Modify data
  • Program with Transact-SQL
  • Implement error handling and transactions

Prerequisites

Students who take this training should understand:

  • Create Transact-SQL SELECT queries
  • Work with data types and NULL
  • Query multiple tables with JOIN
  • Explore set operators
  • Use functions and aggregate data
  • Work with subqueries and APPLY
  • Use table expressions
  • Group sets and pivot data
  • Modify data
  • Program with Transact-SQL
  • Implement error handling and transactions
Delivering a Data Warehouse in the Cloud

Learn how to deploy, design, and load data using Microsoft’s Azure SQL Data Warehouse.

Summary

Length
6 weeks (2 to 3 hours per week)
Level
Intermediate
Language
English

About this course

When you need to scale your data warehouse’s storage and processing capabilities in minutes, not months, you need a cloud-based massively parallel processing solution.

In this computer science course, you will learn how to deploy, design, and load data using Microsoft’s Azure SQL Data Warehouse, or SQL DW. You’ll learn about data distribution, compressed in-memory indexes, PolyBase for Big Data, and elastic scale.

Note: To complete the hands-on elements in this course, you will require an Azure subscription. You can sign up for a free Azure trial subscription (a valid credit card is required for verification, but you will not be charged for Azure services). Note that the free trial is not available in all regions. It is possible to complete the course and earn a certificate without completing the hands-on practices.

What you’ll learn

  • Choosing a massively parallel processing architecture for a cloud-based data warehouse.
  • Designing tables and indexes to efficiently distribute data in tables across many nodes.
  • Loading data from a variety of sources, querying using PolyBase, securing and recovering data, and integrating into Big Data environments.

Prerequisites

Students who take this training should understand:

  • Familiarity with database concepts and basic SQL query syntax
  • Familiarity with the reporting and analytics needs of users
  • A willingness to learn actively and persevere when troubleshooting technical problems is essential

Course Syllabus

Students who take this training should understand:

Module 1: Introducing SQL Data Warehouse

This module introduces Azure SQL Datawarehouse, Microsoft’s data warehouse in the cloud. You’ll learn about massively parallel processing, how to provision and configure SQL DW.

Module 2: Designing and Querying Data Warehouses

This module covers table design, partitioning, indexes and statistics. It introduces elastic query and tools for monitoring queries.

Module 3: Integrating and Ingesting Data

This module covers loading data into SQL DW with Azure Data Factory, Polybase, and Azure Stream Analytics. It also covers integrating with Azure Machine Learning, and visualizing data with Power BI.

Module 4: Managing Data Warehouses

This module covers monitoring and managing SQL DW workloads and performance, security, scaling, and managing backups.

Final Exam

The final exam accounts for 30% of your grade and will be combined with the weekly quizzes to determine your overall score. You must achieve an overall score of 70% or higher to pass this course and earn a certificate.

Process Big Data at Rest (2 training options available)

Processing Big Data with Azure Data Lake Analytics ( Option 1 )

Summary

Length
4 weeks (3 to 4 hours per week)
Level
Advanced
Language
English

About this course

Want to store and process data at scale? This data analysis course teaches you how to apply the power of the Azure cloud to big data using Azure Data Lake technologies.
Learn how to manage data in Azure Data Lake Store and run U-SQL jobs in Azure Data Lake Analytics to generate insights from structured and unstructured data sources.

Note: To complete this course, you will need a Microsoft Azure subscription. You can sign up for a free trial subscription at http://azure.microsoft.com, or you can use your existing subscription. The labs have been designed to minimize the resource costs required to complete the hands-on activities.

What you’ll learn

  • Azure Data Lake technologies to store and process data using U-SQL jobs
  • Create and use U-SQL catalog objects
  • Extend your data processing scripts with custom C# code
  • Monitor and optimize U-SQL jobs

Syllabus

  • Module 1: Getting Started with Azure Data Lake Analytics
  • Module 2: Using a U-SQL Catalog
  • Module 3: using C# Functions in U-SQL
  • Module 4: Monitoring and Optimizing U-SQL Jobs
Processing Big Data with Azure HDInsight ( Option 2 )

Summary

Length
5 weeks (3 to 5 hours per week)
Level
Intermediate
Language
English

About this course

More and more organizations are taking on the challenge of analyzing big data. This course teaches you how to use the Hadoop technologies in Microsoft Azure HDInsight to build batch processing solutions that cleanse and reshape data for analysis. In this five-week course, you’ll learn how to use technologies like Hive, Pig, Oozie, and Sqoop with Hadoop in HDInsight; and how to work with HDInsight clusters from Windows, Linux, and Mac OSX client computers.

NOTE: To complete the hands-on elements in this course, you will require an Azure subscription and a Windows, Linux, or Mac OS X client computer. You can sign up for a free Azure trial subscription (a valid credit card is required for verification, but you will not be charged for Azure services). Note that the free trial is not available in all regions. It is possible to complete the course and earn a certificate without completing the hands-on practices.

What you’ll learn

  • Provision an HDInsight cluster.
  • Connect to an HDInsight cluster, upload data, and run MapReduce jobs.
  • Use Hive to store and process data.
  • Process data using Pig.
  • Use custom Python user-defined functions from Hive and Pig.
  • Define and run workflows for data processing using Oozie.
  • Transfer data between HDInsight and databases using Sqoop.

Prerequisites

  • Familiarity with database concepts and basic SQL query syntax
  • Familiarity with programming fundamentals
  • A willingness to learn actively and persevere
Process Big Data in Motion (2 training options available)

Processing Real-Time Data Streams in Azure ( Option 1 )

Summary

Length
4 weeks (3 to 4 hours per week)
Level
Advanced
Language
English

About this course

Want to capture and process real-time data in the cloud?
This data analysis course teaches you how to use Microsoft Azure technologies like Event Hubs, IoT Hubs, and Stream Analytics to build real-time Internet-of-Things (IoT) solutions at scale.
Note: To complete this course, you will need a Microsoft Azure subscription. You can sign up for a free trial subscription at http://azure.microsoft.com, or you can use your existing subscription. The labs have been designed to minimize the resource costs required to complete the hands-on activities.

What you’ll learn

  • Capturing real-time data with Azure Event Hubs and IoT Hubs
  • Processing real-time data with Azure Stream Analytics
  • Aggregating data in temporal windows
  • Monitoring streaming solutions in Azure

Syllabus

  • Module 1: Getting Started with Azure Event Hubs and IoT Hubs
  • Module 2: Using Azure Stream Analytics
  • Module 3: Aggregating Data in Temporal Windows
  • Module 4: Monitoring a Streaming Solution
Processing Real-Time Data with Azure HDInsight ( Option 2 )

Summary

Length
4 weeks (2 to 3 hours per week)
Level
Intermediate
Language
English

About this course

In this four week course, you’ll learn how to implement low-latency and streaming Big Data solutions using Hadoop technologies like HBase, Storm, and Spark on Microsoft Azure HDInsight.
Note: To complete the hands-on elements in this course, you will require an Azure subscription and a Windows, Linux, or Mac OS X client computer. You can sign up for a free Azure trial subscription (a valid credit card is required for verification, but you will not be charged for Azure services). Note that the free trial is not available in all regions. It is possible to complete the course and earn a certificate without completing the hands-on practices.

This course is the second in a series that explores big data and advanced analytics techniques with HDInsight; and builds on the batch processing techniques learned in DAT202.1x: Processing Big Data with Hadoop in Azure HDInsight.

What you’ll learn

  • HBase to implement low-latency NoSQL data stores.
  • Storm to implement real-time streaming analytics solutions.
  • Spark for high-performance interactive data analysis.

Prerequisites

  • Familiarity with Hadoop clusters and Hive in HDInsight
  • Familiarity with database concepts and basic SQL query syntax
  • Familiarity with basic programming constructs (for example, variables, loops, conditional logic). Experience with Java or C# is useful but not essential
  • A willingness to learn actively and persevere when troubleshooting technical problems is essential

Syllabus

  • Module 1: Using HBase for NoSQL Data
  • Module 2: Using Storm for Streaming Data
  • Module 3: Using Spark for Interactive Analysis
  • Module 4: Final Exam
Orchestrating Big Data with Azure Data Factory

Summary

Length
4 weeks (3 to 4 hours per week)
Level
Advanced
Language
English

About this course

Need to schedule and manage big data workflows?
This data analysis course teaches you how to use Azure Data Factory to coordinate data movement and transformation using technologies such as Hadoop, SQL, and Azure Data Lake Analytics. You will learn how to create data pipelines that will allow you to group activities to perform a certain task.

Note: To complete this course, you will need a Microsoft Azure subscription. You can sign up for a free trial subscription at http://azure.microsoft.com, or you can use your existing subscription. The labs have been designed to minimize the resource costs required to complete the hands-on activities.

What you’ll learn

  • Creating data workflows with Azure Data Factory
  • Scheduling data pipelines to orchestrate big data processes
  • Applying data transformations in a pipeline with Hive or U-SQL

Syllabus

  • Module 1: Getting Started with Azure Data Factory
  • Module 2: Scheduling Pipelines
  • Module 3: Transforming Data in Pipelines
Build Big Data Analysis Solutions (3 training options available)

Developing Big Data Solutions with Azure Machine Learning ( Option 1 )

Learn how to build predictive solutions for big data using Microsoft Azure Machine Learning

Summary

Length
4 weeks (3 to 4 hours per week)
Level
Intermediate
Language
English

About this course

The past can often be the key to predicting the future. Big data from historical sources is a valuable resource for identifying trends and building machine learning models that apply statistical patterns and predict future outcomes.

This course introduces Azure Machine Learning, and explores techniques and considerations for using it to build models from big data sources, and to integrate predictive insights into big data processing workflows.

What you’ll learn

  • How to create predictive web services with Azure Machine Learning
  • How to work with big data sources in Azure Machine Learning
  • How to integrate Azure Machine Learning into big data batch processing pipelines
  • How to integrate Azure Machine Learning into real-time big data processing solutions

Prerequisites

Students who take this training should understand:

This course assumes some knowledge of:

  • Building data processing pipelines with Azure Data Factory
  • Building real-time data processing solutions with Azure Stream Analytics

Course Syllabus

Students who take this training should understand:

This course assumes some knowledge of:

  • Module 1: Introduction to Azure Machine Learning
  • Module 2: Building Predictive Models with Azure Machine Learning
  • Module 3: Operationalizing Machine Learning Models
  • Module 4: Using Azure Machine Learning in Big Data Solutions
  • Analyzing Big Data with Microsoft R ( Option 2 )

    Learn how to use Microsoft R Server to analyze large datasets using R, one of the most powerful programming languages.

    Summary

    Length
    4 weeks(2 to 4 hours per week)
    Level
    Intermediate
    Language
    English

    About this course

    The open-source programming language R has for a long time been popular (particularly in academia) for data processing and statistical analysis. Among R’s strengths are that it’s a succinct programming language and has an extensive repository of third party libraries for performing all kinds of analyses. Together, these two features make it possible for a data scientist to very quickly go from raw data to summaries, charts, and even full-blown reports. However, one deficiency with R is that traditionally it uses a lot of memory, both because it needs to load a copy of the data in its entirety as a data.frame object, and also because processing the data often involves making further copies (sometimes referred to as copy-on-modify). This is one of the reasons R has been more reluctantly received by industry compared to academia.

    The main component of Microsoft R Server (MRS) is the RevoScaleR package, which is an R library that offers a set of functionalities for processing large datasets without having to load them all at once in the memory. RevoScaleR offers a rich set of distributed statistical and machine learning algorithms, which get added to over time. Finally, RevoScaleR also offers a mechanism by which we can take code that we developed on our laptop and deploy it on a remote server such as SQL Server or Spark (where the infrastructure is very different under the hood), with minimal effort.

    In this course, we will show you how to use MRS to run an analysis on a large dataset and provide some examples of how to deploy it on a Spark cluster or a SQL Server database. Upon completion, you will know how to use R for big-data problems.
    Since RevoScaleR is an R package, we assume that the course participants are familiar with R. A solid understanding of R data structures (vectors, matrices, lists, data frames, environments) is required. Familiarity with 3rd party packages such as dplyr is also helpful.

    What you’ll learn

    You will learn how to use MRS to read, process, and analyze large datasets including:

    • Read data from flat files into R’s data frame object, investigate the structure of the dataset and make corrections, and store prepared datasets for later use
    • Prepare and transform the data
    • Calculate essential summary statistics, do cross tabulation, write your own summary functions, and visualize data with the ggplot2 package
    • Build predictive models, evaluate and compare models, and generate predictions on new data

    Prerequisites

    • Familiarity with R
    Implementing Predictive Analytics with Spark in Azure HDInsight ( Option 3 )

    Learn how to use Spark in Microsoft Azure HDInsight to create predictive analytics and machine learning solutions.

    Summary

    Length
    6 weeks(3 to 4 hours per week)
    Level
    Intermediate
    Language
    English

    About this course

    Are you ready for big data science? In this course, learn how to implement predictive analytics solutions for big data using Apache Spark in Microsoft Azure HDInsight. See how to work with Scala or Python to cleanse and transform data and build machine learning models with Spark ML (the machine learning library in Spark).

    Note: To complete the hands-on elements in this course, you will require an Azure subscription and a Windows client computer. You can sign up for a free Azure trial subscription (a valid credit card is required for verification, but you will not be charged for Azure services). Note that the free trial is not available in all regions.

    What you’ll learn

    • Using Spark to explore data and prepare for modeling
    • Build supervised machine learning models
    • Evaluate and optimize models
    • Build recommenders and unsupervised machine learning models

    Prerequisites

    • Familiarity with Azure HDInsight.
    • Familiarity with databases and SQL.
    • Some programming experience.
    • A willingness to learn actively in a self-paced manner.

    Course Syllabus

    • Introduction to Data Science with Spark

      Get started with Spark clusters in Azure HDInsight, and use Spark to run Python or Scala code to work with data.

    • Getting Started with Machine Learning

      Learn how to build classification and regression models using the Spark ML library.

    • Evaluating Machine Learning Models

      Learn how to evaluate supervised learning models, and how to optimize model parameters.

    • Recommenders and Unsupervised Models

      Learn how to build recommenders and clustering models using Spark ML.

    Microsoft Professional Capstone : Big Data

    The capstone project is offered directly by Microsoft and can only be done once per quarter: in January, April, July and October.

    Enroll for the full MPP track here in the month prior to the one the capstone starts in, using the Microsoft account you used to register on Azure Academy so that your progress is synced.

    To have your progress synced with Azure Academy and to be eligible for the capstone project you have to have a Certificate of Completion for each one of 9 required courses from Azure Academy.

    Need help?

    If you have questions about our courses, check our FAQs or get in touch with us here.