Determine purpose and scope of big data analysis

Submitted by sylvia.wong@up… on Sat, 03/06/2021 - 15:49

This is the first phase of any data analysis project. In this topic we will look at

  • how organisational requirements can be determined
  • how to identify internal and external big data sources to be analysed 
  • how to establish and confirm parameters to be applied in the analysis.

In this topic, you will also be introduced to some of the key foundation skills you will need to have as a data analyst to determine the purpose and scope of a big data analysis project successfully.

Sub Topics
A group of people discussing big data needs for organisation analysis

To determine organisational requirements, analysts must first identify the purpose and scope of a big data analysis project. This involves: 47

  • learning the business domain
  • assessing the resources available to support the project
  • framing the problem.

Learn the business domain

Analysts need to learn key aspects of business operations (including industry terminology, jargon) to:

  • understand the data generated from the business
  • determine how much of this domain knowledge needs to be implemented in the data models to be created during the analysis process.

For example, if the business is in the transport industry, there are key terminology and business processes that are specific to that industry that you need to be familiar with, such as, mileage, logistics, delivery times etc.

Foundation skill: Learning

Analysts must have the ability to learn new industry related information: Each data analyst project will differ greatly based on different industries and types of business operations. Therefore, analysts need learning skills that enable them to modify their behaviour following exposure to new information in each project.

Assess resources available to support the project

Analysts need to assess the resources available to support the project. This includes evaluating the technology, tools, systems, data and people available for the project. To obtain this knowledge analysts need to:

  • access various sources of information made available to them by the organisation (e.g. various types of documentation, policies, procedures, legislative requirements and guidelines)
  • participate in kick-off meetings to gain key information that sets out the purpose of the project
  • identify and collaborate with key stakeholders of the project (e.g. CDO, supervisors, department managers, other project team members etc.)
  • take an inventory of the resources required for the project to identify any gaps.

Foundation skill: Reading

A person reading data on an tablet

Analysts must have the ability to read, understand and interpret information from various sources: Each data analysis project is based on different information that relates to the business. This information is obtained from various sources such as, policies, procedures, legislation, instructions, specifications, guidelines etc. Therefore, analysts need to have reading skills that enable them to go through a variety of information sources and interpret this information accurately to complete their work.

Frame the problem

The fundamental purpose of analytics is to help managers solve problems and make decisions. 43

Framing the problem is critical to the success of the project. This step helps to identify the scope and purpose of the project. When framing the problem analysts are required to: 47

  • Recognise what specific problems they need to solve in the analysis
  • Identify the main objectives of the project
    • Identify what needs to be achieved in business terms
    • Identify what needs to be done to meet the needs.
  • Clearly define the problem – this would require the analysts to write down the problem statement and key requirements of the project in an appropriate organisational document and share it with key stakeholders.
  • Additionally, analysists should consider the objectives and the success criteria for the project.
    • What is the team attempting to achieve by doing the analysis?
    • What will be considered as “good enough” as an outcome of the analysis?
    • This is critical to document and share with the project team and key stakeholders.

Analysts are required to clearly document the scope of the project so that it can be referred to at different stages of the analysis to ensure all analysis tasks are performed within scope.

Foundation skill: Writing

Analysts must have the ability to use clear, specific and industry-related terminology to clearly document the problem statement and requirements of the project:The entire analysis is based on the problem statement and project requirements documented at this stage. Therefore, analysts need to have writing skills that enable them to document and record industry specific information clearly.

Knowledge check

Complete the following five (5) tasks. Click the arrows to navigate between the tasks.

A diagram depicting examples of internal and external data sources

As part of the first phase of a data analysis project, the analyst team need to identify the kinds of data they will need to solve the business problems. The team will need access to raw data or datasets from various sources to be able to conduct data analysis tasks.

Internal sources

Internal sources of big data are those generated, owned and controlled by your organisation. Typically, internal data sources are easier to collect and are more relevant to an organisation. 48

Some examples of internal data sources include:

  • Transactional data and point of sale (POS) information
    • which includes both current and historical data that relates to the organisations’ own purchases and details of their customers’ purchasing trends.
  • Customer relationship management system (CRM)
    • this holds details of customers, which can be further analysed to understand where customers are located geographically and their company affiliations.
  • Internal documents
    • Today, internal documents are mostly in digitised form (PDFs, Word Documents, Spreadsheets etc.), and they provide a wealth of information about business activities, policies and procedures.
  • Archives
    • Historical data can contain a wealth of information on how the business used to operate and help further investigate patterns or to resolve issues.
  • Device sensors
    • Any Internet of Things (IoT) device used within the organisation can generate data on network connectivity as it relates to the business. For example, a transport company can use sensor data received from their vehicles to source valuable information such as milage, fuel consumption, travel route etc.

External sources

External sources of big data are those generated outside of your organisation.

Data obtained from external sources may include:

  • public data (e.g. publicly available data for machine learning, official statistics, forecasts)
  • unstructured data (e.g. social media posts)
  • other types of data gathered by third-party organisations (e.g. Peak body data, Government data, data collected from Google).

Your organisation may obtain permission to use these sources but does not own or control them.

Consider legislative requirements

It is important that analysts consider legislative requirements when obtaining and using datasets from internal as well as external sources.Analysts must ensure that the data they have received is legitimate and that it complies with data protection and privacy laws as they relate to specific industries, business types, individuals, countries, regions, states, and territories.

Refer to the information we’ve already covered under topic 2.1 Legislative requirements for analysing big data that outlines examples of specific legislation that apply when analysing big data.

Consider policies and procedures

It is important that analysts consider the organisation’s specific policies and procedures when obtaining data from internal sources.

Legislative requirements as part of company policy

An organisation’s policy document may outline specific data protection and privacy legislation that need to be complied with. Consider the following sample policy from ACE Finance that specifically states the legislative requirement for obtaining internal data.

Obtaining data from internal sources

If data is obtained from the organisation’s internal systems for analysis purposes, the following must be considered:

  • According to ACE Finance’s data protection and privacy policy all records related to customers should be de-identified.
  • Except for the ‘Customer ID’ and ‘Segment’, all other identifiable customer data (such as name, phone, email, address etc.) should not be loaded onto the analytic platform, nor be recorded in any form of documentation prepared throughout the analysis process. However, the unique identification of customer’s (Customer ID) and its association with the sales records along with any broad categorisations (e.g. ‘Segment’) should be retained for analysis.”
Obtaining data from external sources

Data obtained from the organisation’s external systems for analysis purposes should consider the following.

According to ACE Finance’s data protection and privacy policy and the Australian Privacy Principles:

  • any external data that contains personal information of individuals should be de-identified
  • ensure that any third-party service providers have good privacy practices to ensure the accuracy of the information they provide
  • in situations where personal information needs to be officially collected:
    • carefully consider whether uses and disclosures of personal information in the third-party data are compatible with the original purpose of collection
    • check that third parties from which personal information is collected have implemented appropriate practices, procedures and systems to ensure the quality of personal information.

Procedures and guidelines

Organisational procedure documents include specific guidelines on

  • how to access internal data
  • specific details of how data flows from internal systems into databases (repositories) from which data can be obtained
  • guidelines on which specific external data sources are recommended for use for analysing data.

Additionally, refer to the information we’ve already covered under topic 2.9 Organisational policies and procedures that outlines examples of specific policies and procedures that apply when analysing big data.

Knowledge check

Complete the following two (2) tasks. Click the arrows to navigate between the tasks.

A couple of data analysts discussing data with laptops

The data obtained from internal and/or external sources must be evaluated to identify their data structure. This will help the analyst team to:

  • identify the categories of data types and formats in the available datasets
  • identify the amount of data they need for the analysis
  • identify any gaps in the data required for the analysis
  • identify the type of tools, statistical models and visualisations required for the analysis.

Taxonomy of data types

Before identifying the structure of datasets, it is important to understand the taxonomy of data types. This involves understanding the difference between numeric and categorical variables as well as understanding the levels of measurements.

Formats of structured data

There are two basic types of structured data: numeric and categorical. 26

These two types can be further identified as follows.

A diagram depicting Formats of structured data

Time series data – this is a collection of variables which has a set of observations on values at different points in time. They are actually collected at fixed intervals. Such as daily, weekly, monthly or annually.

Levels of measurement

It is also important to understand the levels of measurement as qualitative and quantitative, which is further characterised as:

  • Nominal
  • Ordinal
  • Interval
  • Ratio

Qualitative data can be nominal or ordinal and Quantitative data can be interval or ratio.

To learn more about the levels of measurement watch the following video.

Data dictionary

To help analysts better understand the datasets it is recommended to create and maintain a document called a ‘data dictionary’. This will help everyone in the analyst team better understand the data types and formats of each field in the dataset.

It is also important to maintain and update the data dictionary during and after the analysis as it provides a good source of reference for what type of data is used in the analysis project.

A data dictionary would typically include information such as the following.

  • Field/Column name – a list of the field/column names of the entire dataset
  • Reference Table(s) – There can be one or more reference tables in a dataset. Therefore, it is important to indicate the name of the relevant reference table for each field/column in the dataset.
  • Data Type – The data type for each field/column should be identified by their category and/or level of measurement. (e.g. Numerical, Categorical, Nominal, Ordinal, Ratio, Interval)
  • Format – The format for each field/column should be indicated as its appropriate type (e.g. text, date, whole number, currency, decimal number)
  • Description –This is a brief description of what is presented in each data field in the dataset.

Following is an example of a data dictionary.

Field/column name Reference Table/s Data Type Format Description
Employee ID Employee Numerical Whole Number A unique identifier for each employee.
Employee Name Employee Categorical Text The first and last names of employees.
Date Joined Employee Interval Date The date the employee joined the organisation.

Watch the following video to understand the importance of creating a data dictionary.

Calculating the shape and size of data

When calculating the shape and size of data, it is important to first identify the number of ‘variables’ and ‘observations’ in a dataset. The following example shows data with 3 variables and 15 observations.

  3 variables

15 observations
Weight (kg) Length (cm) Region
290 30 East
296 35 East
299 34 East
300 34 East
305 38 East
307 40 North
311 46 North
315 45 North
325 49 North
339 48 North
340 55 South
355 58 South
357 55 West
359 57 West
361 59 West
Adapted from What is an Observation in Statistics? by Statology 2020

Excel formulas and functions can be used to easily calculate the required values (including the number of variables and observations) that can provide a basic idea about the size, shape and contents of the dataset.

Excel formulas

One of the main functions of Excel is to perform calculations, and the most basic operation is to manually enter formulas. The is done by selecting a cell you want for the result and writing a formula. Formulas start with an equals sign then a mathematical equation. Refer to the article Overview of formulas in Excel to learn some simple examples of how to create Excel formulas.

Watch the following video to understand how to use Excel formulas.

Excel functions

Excel has a library of functions that can be applied to cells, columns and rows of data. Functions can also apply across sheets. There are hundreds of functions, and to make them easier to utilise, they are grouped into categories, for example, financial, logical or mathematical. The functions apply to individual cells or groups of cells.

The following video shows some of the functions available in Excel.

Using formulas and functions to determine the shape of a dataset

Determining the shape of a dataset involves identifying the following regarding the dataset’s fact table:

  • the exact number of observations/records
  • the number of fields/columns
  • the total number of cells
  • number of missing/blank cells.

Excel formulas and functions can be used to determine each of these very easily. Explore how each of the functions listed below are used in Excel and answer the following question.

  • COUNT
  • COUNTA
  • COUNTBLANK
Knowledge check

A topshot of a farm with divisions of land

To establish the parameters for an analysis one needs to

  • understand the organisational requirements or the business problem that needs to be solved by the analysis. This will set out the scope and limitations for the analysis
  • explore the dataset to find out information about the observations and variables in the dataset to understand the type of data in the dataset
  • determine which variables are the most appropriate for the analysis considering the business requirement and the type of data available in the dataset.

Data exploration

Once the data analysis requirements are gathered and the scope has been determined during project kick-off, the analysts should conduct a preliminary examination of the datasets by performing data exploration tasks.

Organisations may have their recommended way of documenting the results of the data exploration activities. As part of data exploration analysts need to identify the parameters required for the analysis and then summarise these parameters using appropriate visualisations for better understanding of the dataset.

What is a parameter?

Parameters refer to any variables or metrics that are required to conduct the analysis of any given dataset. Parameters should be defined based on the business requirement for the analysis.

  • Variables – include values required for filtering, categorising the dataset analysis (e.g. Customer segments, Product categories etc.)
  • Key Measures – include details of the calculated measures that will be required to obtain the numerical data for the analysis (Total sales, average income, Profit margin etc.)

Note: Specific options available when determining essential variables (e.g. ‘Data Type’, ‘Format’ and ‘Data Category’ settings) and key measures (query statements, functions and formulas) can vary based on the analytics platforms or tool used for the analysis. Since you will be using Microsoft Power BI Desktop as the platform for conducting big data analysis in this module, the format specifications will relate to what is available in Microsoft Power BI.

Determining essential variables

Analysts must ensure that the variables selected are appropriate to meet the business requirements of the analysis and are within the project scope.

The identified variables must then be documented using the organisation’s preferred method (e.g. document template).

When documenting the details of variables, the following need to be recorded.

  • Variable name- may correspond to a field name in the dataset or a custom variable name.
  • Existing field or new field? - this is to identify if the field identified as the variable already exists in the dataset or whether it is a new field that needs to be created.
  • Data Type and Format -The data type and format should be specified within the types available in the Power BI platform. For the correct data type and format specifications, refer to Data types in Power BI Desktop - Power BI | Microsoft Docs
  • Data Category- The data category should be specified within the types available in the Power BI platform. For the correct data category specifications, refer to Data categorization in Power BI Desktop - Power BI | Microsoft Docs
  • Sort by column- Define the sort order of the variable by choosing an appropriate column name that would provide the desired result. For more information on sorting a column by another column refer to, Sort one column by another column in Power BI - Power BI | Microsoft Docs

Here’s an example of how you can identify essential variables for a given data analysis requirement.

Organisational requirement for analysis: ACE Finance wants to identify the number of ‘Total Customers’ that ACE Finance had served over the past few years by ‘Month-Year’ per ‘State’.

Method: In the statement that describes the organisational requirement for analysis, pay close attention to the underlined words. These will give you a clue what are the essential variables required for the analysis.

Here’s the basic thought process of how analysts should break-down the requirement to identify the essential variables.

  • To identify total customers, analysts can use the ‘Customer ID’ field and get a count of all the unique values. Therefore ‘Customer ID’ can be identified as one of the essential variables.
  • To categorise the total customer data by ‘Month-Year’, a new field will need to be created if not already available in the dataset. However, the Month-Year field (Jan-2022, Feb-2022, Mar-2022) will only be sorted in alphabetical order (that is Feb-2022, Jan-2022, Mar-2022) which is not ideal for the analysis. Therefore it is important to create a numerical field that is made of the ‘Year’ and the ‘Month’ number (e.g. 202201, 202202, 202203 etc) so that this field can be used to sort the ‘Month-Year’ field.
  • The ’State’ field is also required to further categorise the total customer data. There is a special category in Power BI called ‘State or Province’ that can be specified as the ‘Data Category’ for this field, and it can be sorted by the ‘Postal Code’ column that we assume that exists in the dataset.

The following table lists some examples of how to document essential variables required for an analysis.

Variable name Existing field or New field? Data Type and Format Data Category Sort by column
Customer ID Existing field Numerical Uncategorised Customer ID
Year Month Number New Field Whole Number Uncategorised Date
Month-Year New Field Date, mmm-yyyy Uncategorised Year Month Number
State Existing field Text State or Province Postal Code

Determining key measures

Analysts must determine the key measures required and ensure that these measures are according to the business requirement for analysis.

The identified key measures must then be documented using the organisation’s preferred method (e.g. document template).

When documenting the details of variables, the following needs to be recorded.

  • Measure Name – Indicate the name of the calculated measure.
  • DAX Formula – indicate the DAX formula to be used to calculate the measure.
  • Format – check the source dataset to obtain calculated values.

Note: For more information on DAX measures, formula and format specifications refer to Measures in Power BI Desktop - Power BI | Microsoft Docs.

Here’s an example of how you can identify key measures for a given data analysis requirement.

Organisational requirement for analysis: ACE Finance wants to identify the number of ‘Total Customers’ that ACE Finance had served over the past few years by ‘Month-Year’ per ‘State’.

Method: In the statement that describes the organisational requirement for analysis, pay close attention to the underlined words. These will give you a clue as to what are the key measures required for the analysis.

Here’s the basic thought process of how analysts should break-down the requirement to identify the key measures.

  • To identify total customers, analysts can count the distinct values in the ‘Customer ID’ field.
  • To create the new ‘Month-Year’ field a new calculated column can be created using the ‘Month’ and ‘Year’ fields that we assume already exists in the dataset. We assume here that there is a ‘Date’ table included in the dataset that contains ‘Month’ and ‘Year’ fields.
  • To create the new ‘Year Month Number’ field required for sorting the ‘Month-Year’ field, a new calculated column can be created. Here you will need to use a mathematical formula to generate this numerical value, as shown in the following table. We assume here that there is a ‘Date’ table included in the dataset that contains ‘Year’ and ‘Month Number’ fields.

The following table lists some examples of how to document the key measures required for an analysis.

Measure Name DAX Formula Format
Total Customers DISTINCTCOUNT(‘Customer’[Customer ID]) Whole Number
Month-Year ‘Date’[Month] & “-“ & ‘Date’[Year] Text
Year Month Number ‘Date’[Year]*100 + ‘Date’[Month Number] Whole Number

Summarise and visualise established parameters

Visualisation methods used during data exploration helps to come up with graphs and charts which can be:

  • 1-dimensional
  • 2-dimensional
  • 3-dimensional

Let’s further explore how PivotTables, PivotCharts and slicers are used for data exploration.

PivotTables

A PivotTable is a summary of any selected data. It does not change any data; it pivots the view to show statistical insights. The PivotTable tool in excel creates a data summary of selected columns and rows. The speed and ease of use make this tool a popular starting point when analysing data sets.

Watch the following video to understand how to use PivotTables in Excel to explore categorical data in a dataset.

PivotCharts

PivotCharts are helpful to visualise data summarised in a PivotTable.

Watch this video from Microsoft to learn how to create a PivotChart to visualise the data in a PivotTable.

Using slicers to filter data

Excel has other simple-to-use features to assist data analysis. Report filters can be used to exclude certain data. Slicers are a simple filtering tool to ignore data that perform a similar function to report filters.

PivotCharts give a visualisation of the selected data. The chart automatically updates as filters are applied to provide analysts insights into trends and relationships.

Watch this video from Microsoft to learn how to use slicers to filter data in a PivotTable.

Complete the following practical activity

Download the sample dataset. 49

Organisational requirement for the analysis: ACE Finance wants to identify the number of ‘Total Invoices’ processed and, ‘Total Sales’ (calculated per invoice) by the number of ‘Total Customers’ in each ‘Country’ per ‘Month-Year’.

Hint: Before calculating the ‘Total Sales’, consider creating a new calculated column called ‘Sales’ to calculate the sales amount per product.

Q1: Use the following table format to document the essential variables required for the analysis.

Variable name Existing field or New field? Data Type and Format Data Category Sort by column
         
         
         
         

Q2: Use the following table format to document the key measures required for the analysis.

Measure Name DAX Formula Format
     
     
     
     
     

Q3: Create a PivotTable to summarise the information from the established parameters.

Q4: Create a PivotChart to visualise the summarised data in the PivotTable.

Q5: Use slicers to filter data in the PivotChart visual.

Check your answers.

The answer tables for Question 1 and 2 are as follows.

Q1: Answer

Variable name Existing field or New field? Data Type and Format Data Category Sort by column
Country Existing Field Text Country Country
Sales New Field Decimal Number, Currency Uncategorised Sales
Year Month Number New Field Whole Number Uncategorised Date
Month-Year Existing Field Text, mmm-yyyy Uncategorised Year Month Number

Q2: Answer

Measure Name DAX Formula Format
Total Invoices DISTINCTCOUNT(‘Online Retail’[InvoiceNo]) Whole Number
Sales ‘Online Retail’[UnitPrice]* ‘Online Retail’[Quantity] Currency, Decimal Number
Total Sales SUM[Sales] Currency, Decimal Number
Total Customers DISTINCTCOUNT(‘Online Retail’[CustomerID]) Whole Number
Year Month Number Date[Year]*100 + ‘Date’[Month Number] Whole Number
A pair of data analysts discussing big data parameters

Why do we need to confirm parameters for the analysis?

The parameters established in the previous steps (including variables and key measures) have the potential to be incorrect or incomplete, due to various factors. For example, there may be

  • misinterpretations of the business requirements
  • changes in the business requirements
  • new additions to the previously stated requirements
  • new systems or tools used for the analysis
  • changes in the datasets received
  • new datasets to be used for the analysis that were previously not available.

The purpose of confirming the parameters established by the analysts is to ensure that their interpretation of the requirement is accurate and complete before moving forward with the analysis tasks.

How to confirm parameters?

Once the parameters have been established and data validation performed, this information needs to be confirmed with appropriate personnel associated with the analysis project, such as:

  • Project supervisor
  • Subject matter experts involved in the project
  • Relevant department managers
  • Other stakeholders, if involved in the project.

Confirmation needs to be formally requested and evidence of this confirmation should be kept as supporting evidence as a valuable reference throughout the analysis project. Some examples of how formal confirmation of established parameters can be obtained are as follows.

  • Face-to-face discussions and meetings – meeting minutes can be recorded and kept as supporting evidence of the confirmation and feedback received from the stakeholders.
  • Email requests and responses – email responses from stakeholders can be kept as formal records and supporting evidence of confirming parameters for the analysis.

Regardless of the communication method used, it is important to:

  • use effective and clear language to communicate your ideas
  • ask the right questions (e.g. open questions)
    • to inquire about advice
    • request for confirmation of the established parameters
  • provide evidence of the established parameters– this can be in the form of:
    • word documents
    • Excel spreadsheets with numerical information and visualisations

For example, let’s investigate the standard protocols that should be followed when consulting your supervisor via email.

Email etiquette

When writing an email, ensure that you:

  • address the email to the correct person(s) email address
  • use the standard email template as recommended by your organisation
  • include the following main elements within the structure of the email:
    • Subject-line
    • Openers (greeting)
    • Body
    • Closings (sign-off)
  • include all necessary information regarding the data anomalies
  • check for any errors in grammar, spelling and punctuation

Refer to the How to Write a Proper Email: Make the Right Impression | Grammarlyto gain more insights and tips on using proper email etiquette.

Watch the following video to learn the recommended protocols that you need to consider when writing a formal email.

Your role: You are working as a Data Analyst at ACE Finance.

Task: Draft an email to your supervisor ‘Andrew Walsh’, requesting confirmation for the parameters you’ve established for the data analysis project.

Note: Assume that you are including an attachment of the document you’ve created to record the details of variables, key measures, PivotTables and PivotCharts with this email. Ensure that you mention information regarding this attachment in the body of the email for your supervisor’s attention.

Topic summary

Congratulations on completing your learning for this topic Determine the purpose and scope of big data.

In this topic you have learnt the following fundamental concepts behind analysing big data.

  • Determine organisational requirements for big data analysis
  • Identify internal and external sources of big data to be analysed
  • Legislative requirements relating to identifying internal and external sources of big data
  • Organisational policies and procedures relating to identifying internal and external sources of big data
  • Common tools to analyse big data, including features and functions of Excel software
  • Establish and confirm parameters to be applied in analysis.

Check your learning

The final activity for this topic is a set of questions that will help you prepare for your formal assessment.

End of topic quiz

Knowledge check

Complete the following five (5) tasks. Click the arrows to navigate between the tasks.

Assessments

Now that you have completed the basic knowledge required for this module, you are ready to complete the following two assessment events:

  • Assessment 2 (Short Answer Questions)
  • Assessment 3 (Project)

What’s next?

Next, we will dive into how data analysis activities are carried out and how initial trends and relationships in captured big data are analysed.

Module Linking
Main Topic Image
A data analyst sitting at a table in a modern office working on a data project
Is Study Guide?
Off
Is Assessment Consultation?
Off