Considering any legislative requirements should be the starting point for any big data project. This helps to specify certain boundaries for the project and identify any limitations for using specific types of data.
Data protection and privacy concerns
Following are some of the issues that need to be considered by businesses when working with big data. 4
Expand each item below to see more details.
Regulatory requirements dictate that personal data must be processed for specified and lawful purposes and that the processing must be adequate, relevant and not excessive.
When dealing with confidential and sensitive information included in the big data being tested, businesses need to be transparent on how the information is handled and protected and demonstrate that confidentiality is maintained at all stages.
Identifying personal information
Watch the following video to understand how the Australian Public Service (APS) is committed to protecting the personal information of Australians. Pay close attention to how you can identify ‘personal information’ and how this information should be handled.
After watching the video, answer the following questions.
When carrying out data analytic activities especially here in Australia, there are various data protection and privacy laws and regulations that businesses need to comply with when testing big data sources.
This is often done by: 5
- using a ‘privacy-by-design’ approach
- involves embedding privacy into the project, instead of making it an afterthought
- considering social responsibility in data analytics projects
- involves having an ethics-based approach to ensure that the processing of personal information as part of an organisation’s data analytics is carried out in a fair, transparent, responsible and ethical manner
- conducting Privacy Impact Assessments (PIAs)
- for the purpose of identifying risks and making appropriate recommendations.
Watch the following video to understand why Privacy Impact Assessments should be carried out and when they should be carried out.
Refer to the APS Big Data Strategy document published by the Office of the Australian Information Commissioner (OAIC) which outlines how to build privacy into data analytic activities. 6
Complete the following task based on the information from the APS Big Data Strategy document.
Data protection, privacy laws and regulations
Privacy Act in Australia
The Privacy Act 1988 makes provisions to protect the privacy of individuals in Australia. Therefore, businesses that deal with the personal information of individuals in Australia must ensure that they comply with this legislation.
Watch the following video to understand what ‘privacy information’ is and how personal information requires to be protected according to the Privacy Act in Australia.
After watching the video answer the following question.
Australian Privacy Principles (APPs)
Under the Privacy Act 1988 the Australian Privacy Principles (APPs) provide guidelines for the collection, management and use of personal information.
Refer to the Guide to data analytics and the Australian Privacy Principles published by the Office of the Australian Information Commissioner (OAIC) for more detailed information about how the APPs specifically apply to data analytics activities, especially ‘data integration’ as it relates to the topic covered in this module.
Data integration refers to the process of bringing together multiple datasets to provide a new dataset. 5
Watch the following video to understand more about the APPs. Pay attention to those specific APPs that are relevant to data integration activities (e.g. testing big data collected from various sources).
General Data Protection Regulation (GDPR)
The European Union General Data Protection Regulation (the GDPR) contains new data protection requirements that came into effect from 25 May 2018. According to the General Data Protection Regulation (GDPR):
Australian businesses of any size may need to comply if they have an establishment in the EU, if they offer goods and services in the EU, or if they monitor the behaviour of individuals in the EU. 7
For more information about GDPR, refer to the GDPR Australia How to Comply with General Data Protection Regulation.
Watch the following video to understand what GDPR is and how it impacts privacy and data protection requirements for businesses that operate in Australia.
Privacy laws and regulations for specific industries
In addition to the Privacy Act, the APS Big Data Strategy document published by the Office of the Australian Information Commissioner (OAIC) outlines a number of other legislative controls that apply to businesses that operate in various industries. 6
Some examples are given below.
- Health Records and Information Privacy Act 2002 (HRIP Act)
- Telecommunications Act 1997 (Part 13)
- Telecommunications (Interception and Access) Act 1979
Data protection and privacy laws of states and territories
Here are some examples of data protection and privacy legislation as it applies to different states and territories in Australia.
- The Information Privacy Act 2009 (Qld)
- Public Sector (Data Sharing) Act 2016 (SA)
- The Privacy and Data Protection Act 2014 (Vic)
- The Personal Information and Protection Act 2004 (Tas)
- Privacy and Personal Information Protection Act 1998 [NSW]
The final activity for this topic is a set of questions that will help you prepare for your formal assessment.
Big data sources can be of a variety of types. These data sources can be categorised based on the:
- format of the raw data
- features of the systems that generate the raw data
- where the data is generated from (e.g. either internally or externally)
- whether it has a time-dimension attached to it or not (e.g. either transactional or non-transactional).
Now let’s take a closer look at these categories of big data sources.
Formats of common big data sources
Big data sources generally fall into one of three main types:
- Structured
- Unstructured
- Semi-structured
Structured
Structured data refers to data that is identifiable and organised in a defined way. 6 Therefore this type of data is easier to search and analyse as its elements are addressable for effective analysis.
Structured data is usually stored in a database having one or more tables. Tables contain one or more columns that specify what type of data should be stored on each column. Then the actual data records are stored within the rows of a table.
To make things clear, consider the following terms and their definitions.10
- Database – a container that stores organised data.
- Table – a structured list of data of a specific type. (e.g.
Customer table
,Order table
,Suppliers table
) - Column – a single field in a table. All tables are made up of one or more columns. (e.g. In a
Customer table
a column names could beCustomer ID
,Customer Name
,Phone
,Email
) - Row – a record in a table
- Datatype – this refers to the type of allowed/restricted data in a column. (e.g. characters, numbers etc)
- Schema – information about database and table layout and properties.
- Key column(s) – a column (or set of columns) whose values uniquely identify every row in a table.
Consider the following Customer
table as an example of how structured data can be stored in a database.
Notice that:
- the table column headings (e.g.
Customer ID
,Customer Name
,Phone
,Email
) indicate what type of information about customers is stored in the table - the table rows contain records of specific customer details
- the table contains a key column (e.g.
Customer ID
) to uniquely identify each record in the table.
Customer ID | Customer Name | Phone | |
---|---|---|---|
2658 | Fiona Black | 0426 142 866 | fiona.black@mail.com |
1952 | Benjamin Low | 0496 100 200 | benlow@mail.com |
2587 | Grace Murray | 0423 654 258 | gmurray@mail.com |
9542 | John Kelly | 0438 962 800 | kellyjohn@mail.com |
The following image shows an example of a database schema or model.
Notice how the key columns are used to form relationships or links with other tables in the schema.
Some of the identifiable features of structured data are that:8
- it is organised into a well-designed format (e.g. tabular)
- it is generally stored in relational databases (e.g. SQL, MySQL, Oracle) and enterprise management systems (e.g. ERP, CRM)
- it is less flexible and less scalable as it confines to its pre-defined structure
- it is easier to query and process this data to obtain meaningful results
- it allows for parallel processing of data and is often used to capture transactional data generated from multi-tasking systems such as point-of-sales terminals, retail systems etc.
Unstructured
Unstructured data refers to data that has little identifiable structure 6 and cannot be contained in a row and column database.
Unstructured data can be textual (e.g. email body, log files, documents, social media posts) or non-textual (e.g. audio, video, images).
Some of the identifiable features of unstructured data is that: 8
- it does not have a pre-defined data format
- usually stored in a datalake*
- it does not have a specific format or structure
- it cannot be stored in traditional relational databases
- it does not contain any identifiable tags or keywords
- it is harder and complex to query and process this data in order to obtain meaningful results
- it requires a specialised skillset to be able to analyse and model this type of data.
* A datalake is a centralised repository that allows general storage of all types of data from any source
Semi-structured
Semi-structured data refers to data that does not conform to a formal structure based on standardised data models. 6
As semi-structured data appears to be unstructured at first glance. Therefore, it is often hard to clearly differentiate between these two types. 9
However, some of the identifiable features of semi-structured data are that: 8
- it is not fully formatted as structured data
- it is organised up to some extent (e.g. CSV, XML, JSON, HTML), but not at the level of structured data
- it is not stored using tabular formats or in traditional relational databases
- flexibility and scalability is more than structured data but when compared with unstructured data, it is less flexible and scalable
- it is best managed in NoSQL databases
- it contains identifiable tags or keywords that helps query and process the data somewhat easily than unstructured data. However, the process is not as easy as structured data.
The following images show examples of XML, JSON and CSV files. Notice the use of various tags and keywords used to provide some form of structure for the data contained in these files.
An example of a XML file
<customers> <customer> <customerid>2658</customerid> <customername>Fiona Black</customername> <phone>042 614 2866</phone> <email>fiona.black@mail.com</email> </customer> <customer> <customerid>1952</customerid> <customername>Benjamin Low</customername> <phone>049 610 0200</phone> <email>benlow@mail.com</email> </customer> <customer> <customerid>2587</customerid> <customername>Grace Murray</customername> <phone>042 365 4258</phone> <email>gmurray@mail.com</email> </customer> <customer> <customerid>9542</customerid> <customername>John Kelly</customername> <phone>043 896 2800</phone> <email>kellyjohn@mail.com</email> </customer> </customers>
An example of a JSON file
{ "Customers": [ { "Customer_ID": "2658", "Customer_Name": "Fiona Black", "Phone": "042 614 2866", "Email": "fiona.black@mail.com" }, { "Customer_ID": "1952", "Customer_Name": "Benjamin Low", "Phone": "049 610 0200", "Email": "benlow@mail.com" }, { "Customer_ID": "2587", "Customer_Name": "Grace Murray", "Phone": "042 365 4258", "Email": "gmurray@mail.com" }, { "Customer_ID": "9542", "Customer_Name": "John Kelly", "Phone": "043 896 2800", "Email": "kellyjohn@mail.com" }, ] }
An example of a CSV file
Customer_ID, Customer_Name, Phone, Email 2658, "Fiona Black", 042 614 2866, fiona.black@mail.com 1952, "Benjamin Low", 049 610 0200, benlow@mail.com 2587, "Grace Murray", 042 365 4258, gmurray@mail.com 9542, "John Kelly", 043 896 2800, kellyjohn@mail.com
Watch the following video to understand more about big data types.
After watching the video, complete the following activities.
Features of common big data sources
Big data sources may be defined as follows:
- Batched – This is a collection in ‘batches’ over time, then fed into a system for testing
- Real-time –This is fed into the system piece-by-piece, and tested in real-time
- Interactive – This allows the analysts to manipulate the data during collection and testing.
Let us look into the features of each of these types of big data sources.
Expand each item below to see more details.
Data sourced from batch processing systems typically involve very large data sets. Often testers need to wait for hours or days till a certain amount of raw data is collected before processing that data. This type of data source is characterised as non-time sensitive as computations performed on the data require significant time to generate results.
The common procedure when working with batched sources is to schedule data extraction, transforming and loading (ETL) jobs to run at a set time or when data reaches a certain threshold.
Examples: Payroll data, billing data, customer orders, archived data.
This type of big data source typically deals with structured or semi-structured data formats.
Raw data from this source:
- includes smaller chunks of data that are changing/added to the system rapidly
- is processed immediately (within seconds) using simple calculations
- uses in-memory computing for fast processing.
- typically deals with unstructured data formats.
Examples: Radar systems, Twitter feeds, Bank ATMs.
Raw data from this source:
- is used to find patterns in data
- provides the ability to work with more data
- is processed within minutes and with low latency
- can scale reliably when new data becomes available
- can be explored using various platforms that can run ad-hoc queries.
Internal and external big data sources
It is important that you are able to identify internal and external sources of big data for testing.
Expand each item below to see more details.
Internal sources of big data are those generated, owned and controlled by your organisation. Typically, internal data sources are easier to collect and are more relevant to an organisation. 13
Some examples of internal data sources include:
- Transactional data and point of sale (POS) information – which includes both current and historical data that relates to the organisations’ own purchases and details of their customers’ purchasing trends.
- Customer relationship management system (CRM) – this holds details of customers which can be further analysed to understand where customers are located geographically and their company affiliations.
- Internal documents – Today, internal documents are mostly in digitised form (PDFs, Word Documents, Spreadsheets etc.) and they provide a wealth of information about business activities, policies and procedures.
- Archives – Historical data can contain a wealth of information on how the business used to operate and help further investigate into patterns or to resolve issues.
- Device sensors – Any Internet of Things (IoT) devices used within the organisation can generate data on network connectivity as it relates to the business. For example, a transport company can use sensor data received from their vehicles to source valuable information such as milage, fuel consumption, travel route etc.
External sources of big data are those generated outside of your organisation.
Data obtained from external sources may include:
- public data (e.g. publicly available data for machine learning, official statistics, forecasts)
- unstructured data (e.g. social media posts)
- other types of data gathered by third-party organisations (e.g. Peak body data, Government data, data collected from Google)
Your organisation may obtain permission to use these sources, but does not own or control them.
Transactional and non-transactional big data sources
It is important that you are able to differentiate between transactional and non-transactional data sources.
Expand each item below to see more details.
Data sources that have a time dimension and becomes historical once the operation is complete.
Important facts about transactional data are as follows: 11
- They consist of records that relate to any type of financial transactions related to a business
- They are generated by various systems and applications such as point-of-sale servers, ATMs, payment gateways etc
- Each record contains a transaction ID, that is accompanied by a series of items that are part of the transaction
- They lose their relevance over time and becomes historical once a transaction is completed. Therefore, processing and making sense of this type of data need to be done quickly in order to maintain a competitive edge
- Used to understand peak transaction volumes, peak ingestion rates, and peak data arrival rates
- For example transactional data may contain:
- sales order data (e.g. revenue, cost, discounts, products purchased, quantity)
- financial data (e.g. purchases, sales, payments, claims)
- account transfer data (e.g. withdrawals, credits, debits)
- work related employee data (e.g. time-in, time-out, hours worked).
Refer to the article What is Transactional Data? | TIBCO Software which explains transactional data in more detail.
Data sources that are relevant to an organisation for a longer duration.
Example: customer details, product details.
Important facts about non-transactional data are as follows: 12
- Non-transactional data is relevant to an organisation over a long period of time than transactional data.
- For example transactional data may contain customer details (e.g. name, address, preferences, contact details), product details (product ID, product name, category), supplier details, account information etc.
The final activity for this topic is a set of questions that will help you prepare for your formal assessment.
To write queries and scripts for big data testing you will need to use some form of programming language. Most applications and big data platforms available today support multiple programming languages with in-built scripting editors to make writing scripts and queries easier, especially for those testers who do not have previous experience with programming languages.
It is also important to note that the type of programming language used in a big data testing scenario would also depend on the type, source and structure of the raw data that needs to be processed.
For example, consider the following scenarios and the choice of programming language:
- If the data is sourced from a traditional relational database containing transactional data that has a well-defined structure, the most basic and standard programming language you would need to use is the Structured Query Language (SQL).
- If the raw big data is semi-structured or unstructured, then a programming language such as Scala can be used. Scala is identified as one of the top open-source programming languages used for big data projects.
There are many other programming languages used for big data testing and analysis such as Python, R, Java etc.
However, for the purpose of this module, you will be introduced mainly to DAX (Data Analysis Expressions) and the Power Query Formula Language (informally known as ‘M code’) as you will be working with these languages to perform big data testing tasks using Microsoft Power BI platform from the next topic.
Power Query Formula Language (M code)
The first step when testing big data samples is to import data into a data model in Power BI. At this stage, the Query Editor in Power BI provides a powerful graphical interface that allows you to perform complex data modifications without having to look into the M code that the Query Editor is building behind the scenes.
Power Query is used to perform the extract, transform, and load (ETL) processing of data. This can be done using the graphical user interface in Power BI without having to worry about writing complex M code.
To see the M code behind a query in Power BI, we can use the Advanced Editor in the Query Editor.
In this course, we will not go into detail about how to write Power Query formulas as it can be easily done via the graphical user interface in Power BI, which you will get to experience in the next topic.
Most importantly you need to understand and learn the basic industry protocols and procedures to write DAX queries and scripts for big data testing. So let’s begin.
DAX rules and requirements
DAX formulas are similar to those used in Excel tables. However there some key differences.
DAX formulas are often written as calculations and comes in three forms.15
Expand each item below to see more details.
Used to perform aggregations (e.g. calculate totals, count distinct values etc.)
Used to add additional columns to tables in a dataset.
Used to create a new table with required columns
Let’s understand the basic rules that you need to follow when writing DAX formulas and specific object naming requirements.
Basic rules
When writing DAX formulas for calculated columns or measures it is important to follow these general rules or protocols.
- A DAX formula always starts with an equal (
=
) sign - After the equals sign, comes the expression which may include elements such as statements, functions, operators, values etc. It is important to note that expressions:
- are always read from left to right
- use parentheses to control the order in which the elements are grouped
- A DAX function always references a complete column, measure or table. This is commonly known as the fully qualified name of that object. For example:
- The fully qualified name of a column refers to the table name, followed by the column name in square brackets. For example,
'AUS Order'[Products]
. - The fully qualified name of a measure refers to the table name, followed by the measure name in square brackets. For example,
'AUS Order'[Profit]
.
- The fully qualified name of a column refers to the table name, followed by the column name in square brackets. For example,
Object naming requirements
- All object names are case-insensitive. (e.g. The table names
'SALES'
and'Sales'
would represent the same table name).- Note that the term object here refers to tables, columns, measures etc
- The names of objects must be unique within their current context. For example:
- measure names must be unique within a model
- column names must be unique within a table
- table names must be unique within a database
- Leading or trailing spaces that are enclosed by name delimiters, brackets or single apostrophes are valid in the names of tables, columns, and measures.
- If the name of a table contains any of the following, you must enclose the table name in single quotation marks.
- Spaces (e.g.
'Order Detail'
) - Reserved keywords (e.g.
'Analysis Services'
cannot be used as a table name unless it is enclosed in quotation marks) - Disallowed characters
.,;':/\*|?&%$!+=()[]{}<>
- Characters outside the ANSI alphanumeric character range
- Spaces (e.g.
Take a look at the following examples of object names used within DAX formulas.
Example | Object type | Comments on the naming requirement |
---|---|---|
Customer |
Table name | This table name does not contain any spaces or other special characters. Therefore, the name does not need to be enclosed in quotation marks. |
'AUS Orders' |
Table name | This table name contains a space therefore it needs to be enclosed in single quotation marks. |
'Discount%' |
Table name | This table name contains a special character. Therefore it needs to be enclosed in single quotation marks. |
Orders[Cost] |
Fully qualified column name | The table name precedes the column name and the column name is enclosed in square brackets. |
Orders[Profit] |
Fully qualified measure name | The table name precedes the measure name and the measure name is enclosed in square brackets. |
[Cost] |
Unqualified column name |
This is just the column name in square brackets. This can be used in certain situations in formulas such as:
|
'AUS Order'[Cost] |
Fully qualified column in table with spaces | The table name contains spaces, so it must be enclosed in single quotation marks. |
Operators
There are different types of operators used when writing DAX formulas. Some of the frequently used DAX operators are listed in the following table.16
Operator type | Metacharacter symbols | Meaning |
---|---|---|
Arithmetic operators | + |
Addition |
- |
Subtraction | |
/ |
Division | |
* |
Multiplication | |
Comparison operators | = |
Equal to |
== |
Strict equal to | |
> |
Greater than | |
< |
Less than | |
<= |
Less than or equal to | |
>= |
Greater than or equal to | |
<> |
Not equal to | |
Text concatenation operator | & |
Connects two values to produce one continuous text value |
Logical operators | && |
AND condition |
|| |
OR condition | |
Parenthesis operator | () |
Used for precedence order and grouping of arguments |
For more information on DAX operators, refer to the DAX operators - DAX | Microsoft Docs.
Now that you are aware of the general rules for writing DAX formulas, let’s look into the structure and basic syntax of a DAX formula.
DAX formula syntax
Before you can write your own DAX functions you need to be familiar with the various elements that make up a DAX formula. 14
Following are the basic elements that make up a DAX measure.
Simply, what this DAX statement does, is to create a measure named Total Costs
that calculates (=
) the SUM()
of values in the CostAmount
column in the Orders
table.
Let’s understand each of the elements in this DAX measure.
- The name of the new measure (e.g.
Total Costs
) - The beginning of the formula is indicated by an equals sign operator (
=
) - DAX function (e.g.
SUM()
– used to add all values in a given table’s column (e.g.Orders[CostAmount]
) - Parenthesis
()
, which surrounds an expression that contains one or more arguments. Most functions require at least one argument. An argument passes a value to a function. - The name of the table that the formula applies to (e.g.
Orders
) - The name of the column that the formula applies to (e.g.
CostAmount
) - Calculated measure name
- DAX formula
Let us take a look at some examples of how the DAX formula syntax can be used to create a calculated column, measure and table. In each of these examples notice how the syntax rules are followed.
- DAX column/measure name (e.g.
Profit
,Total Profit
,Customer
) - DAX Formula
DAX Functions
DAX functions help to perform commonly used data calculations on data models, databases, tables etc.
There are a variety of DAX functions that are inbuilt in the DAX language that can be used for various purposes when writing DAX formulas. It is important to note that each DAX function, has specific syntax rules that must be followed to ensure the script executes without any errors.
For more information on DAX functions and their specific syntax rules, refer to the DAX function reference - DAX | Microsoft Docs.
The following table explores and provides examples of some of the DAX functions, their purpose and how they can be used when writing queries.
Function type | DAX functions | What it does | Example |
---|---|---|---|
Filter functions |
FILTER()
|
Used to get an output of a table that contains only the filtered rows. |
The following example filters all customer records from the Sample expression: FILTER(Customer, Customer[Location] = "NSW") |
Aggregation functions |
SUM()
|
Used to add all the numbers that are in a column. |
The following example adds all the numbers in the SUM(Sales[Cost]) |
DISTINCTCOUNT()
|
Used to return the number of distinct values in a given column. | The following example returns the calculated number of distinct values from the Category column from the Products table.
DISTINCTCOUNT(Category) |
|
Table manipulation functions |
SELECTCOLUMNS()
|
Used to add calculated columns to the given table |
The following example creates a new table called Orders = SELECTCOLUMNS('Company Data', "Order ID", [Order_ID], "Revenue", [Sales], "Cost", [Cost], "Profit", [Sales] – [Cost] ) |
EXCEPT()
|
Used to return the rows of one table minus all the rows of another table. |
The following example returns Table3 = EXCEPT(Table2, Table1) |
|
DISTINCT(Table[Column]) |
Used to return one-column table that contains only the distinct values (no duplicates) from the specified column. |
The following example returns a column only contains distinct values from the DISTINCT(Products[Category]) |
|
DISTINCT(Table) |
Used to return only unique values in a table by removing duplicate rows. |
The following example returns a new table that only contains distinct values from the DISTINCT(Orders) |
DAX Statements
DAX statements are used to query information from structured data sources such as tables, databases and models. 17
There are different types of statements used for various purposes used when writing DAX formulas. It is important to note that each statement type, have specific syntax rules that must be followed to ensure the script executes without any errors.
For more information on DAX statements and their specific syntax rules, refer to the following documentation from Microsoft Statements(DAX) - DAX | Microsoft Docs.
Let’s explore a few examples of DAX statements.
Statement | Description | Example |
---|---|---|
EVALUATE() |
This statement is required to execute a DAX query. |
EVALUATE(Customer) |
ORDER BY |
This is an optional statement that defines one or more expressions used to sort results of a DAX query. Sorting can be specified either as ascending (ASC ) order or descending (DESC ) order. |
ORDER BY Customer[Location] DESC |
Putting it all together
Watch the following video from Microsoft Power BI which demonstrates how the statements, functions, values, operators etc. you’ve learnt in this topic work together in DAX queries and the output it generates in Power BI Desktop.
Note: You will learn to use Power BI Desktop later on in the next topic. Therefore, your focus at this point should be on learning the syntax and the basic rules for using DAX scripts.
In the following series of examples and activities, you will see how statements, functions, values, operators etc. all work together in DAX queries.
For each question within the following activities, make an attempt to write DAX queries yourself before looking at the solutions and answers provided.
Consider the following 'Customers' table:
Customer ID | Firstname | Lastname | Country | State | Company | Customer Address |
---|---|---|---|---|---|---|
1010 | Roosevelt | Romero | US | TXS | Dropellet Intl | RRomero@Dropellet.com |
1011 | Melody | Cruz | US | MIN | Moderock Intl | MCruz@Moderock.com |
1012 | Francis | Moreno | US | MCH | Claster Intl | FMoreno@Claster.com |
1013 | Andy | Jordan | AUS | NSW | Tuttadit Pvt Ltd. | AJordan@Tuttadit.com |
1012 | Jeffry | Warren | AUS | QLD | Ybuwyn Pvt Ltd. | JWarren@Ybuwyn.com |
1015 | Cameron | Rhodes | AUS | NSW | Tribop Pvt Ltd | CRhodes@Tribop.com |
From the Customers
table, you want to create the following new table called 'Customer Summary'
.
Customer ID | Customer Name | Customer Address |
---|---|---|
1010 | Roosevelt Romero | RRomero@Dropellet.com |
1011 | Melody Cruz | MCruz@Moderock.com |
1012 | Francis Moreno | FMoreno@Claster.com |
1013 | Andy Jordan | AJordan@Tuttadit.com |
1012 | Jeffry Warren | JWarren@Ybuwyn.com |
1015 | Cameron Rhodes | CRhodes@Tribop.com |
- What DAX statement would you use to do this?
- Create a script that can result in listing all customer records that belong to the country code
"AUS"
. The script should also list the result in descending order of theCustomer ID
. -
What DAX function would you use to create the following table called
'US Customers'
that lists only the customer records that belong to the country code"US"
.US Customers Customer ID Firstname Lastname Country State Company Customer Address 1010 Roosevelt Romero US TXS Dropellet Intl RRomero@Dropellet.com 1011 Melody Cruz US MIN Moderock Intl MCruz@Moderock.com 1012 Francis Moreno US MCH Claster Intl FMoreno@Claster.com - What DAX function would you use to create another table called
'All except US Customers'
that lists all customer records from theCustomers
table, except for those records in the'US Customers'
table.
-
Customer Summary = SELECTCOLUMNS(Customers, "Customer ID", [Customer ID], "Customer Name", [Firstname] & " " & [Lastname], "Customer Address", [Customer Address] )
-
EVALUATE FILTER(Customers, [Country] = "AUS") ORDER BY[Customer ID] DESC
-
US Customers = FILTER(Customers, Customers[Country] = "US")
-
All Except US Customers = EXCEPT(Customers, 'US Customers')
Consider the following Orders table.
Order ID | Product ID | Product Name | Sales | Cost | Quantity |
---|---|---|---|---|---|
001 | 101002 | Colored Pencils | $ 3.50 | $ 1.00 | 5 |
001 | 101004 | Flourescent Highlighters | $ 5.00 | $ 2.50 | 2 |
002 | 101001 | Crayons | $ 7.50 | $ 3.00 | 1 |
002 | 101003 | Dustless Chalk Sticks | $ 2.50 | $ 1.50 | 2 |
002 | 101004 | Flourescent Highlighters | $ 5.00 | $ 2.50 | 1 |
003 | 101003 | Dustless Chalk Sticks | $ 2.50 | $ 1.50 | 10 |
003 | 101004 | Flourescent Highlighters | $ 5.00 | $ 2.50 | 10 |
- Create a measure called
Profit
that adds up theSales
price for all orders. - Create a measure called
'Distinct Products'
that counts the number of different items in theProduct ID
column. -
What DAX statement would you use to create a new table called
Products
that includes only theProduct ID
,Product Name
,Sales
andCost
columns as follows.Products Product ID Product Name Sales Cost 101002 Colored Pencils $ 3.50 $ 1.00 101004 Flourescent Highlighters $ 5.00 $ 2.50 101001 Crayons $ 7.50 $ 3.00 101003 Dustless Chalk Sticks $ 2.50 $ 1.50 101004 Flourescent Highlighters $ 5.00 $ 2.50 101003 Dustless Chalk Sticks $ 2.50 $ 1.50 101004 Flourescent Highlighters $ 5.00 $ 2.50 -
Create another DAX measure called
'Distinct Products'
that only includes unique product records and no duplicates. This should result in a table output as follows.Distinct Products Product ID Product Name Sales Cost 101001 Crayons $ 7.50 $ 3.00 101002 Colored Pencils $ 3.50 $ 1.00 101003 Dustless Chalk Sticks $ 2.50 $ 1.50 101004 Flourescent Highlighters $ 5.00 $ 2.50
-
Profit = SUM(Orders[Sales])
-
Distinct Products = DISTINCTCOUNT(Orders[Product ID])
-
Products = SELECTCOLUMNS(Orders, "Product ID", [Product ID], "Product Name", [Product Name], "Sales", [Sales], "Cost", [Cost] )
-
Distinct Products = DISTINCT(Products)
Additional resources
Refer to the following sources to learn more about DAX syntax and how it can be used.
- DAX Guide
- Data Analysis Expressions (DAX) Reference - DAX | Microsoft Docs
- DAX Resource Center - TechNet Articles - United States (English) - TechNet Wiki (microsoft.com)
Adding comments in DAX scripts
Adding comments will not affect the performance of the scripts/queries but would make it easier for others to understand the written code and make changes later on if needed.
These examples show the different methods of commenting
Example 1:In-line comments.
Orders = SELECTCOLUMNS('Company Data', --this is an inline comment "Customer ID", [Customer ID], "Customer Name", [Customer Name], "Customer Address", [Customer Address] )
Method: Start your comment with two hyphens/dashes (--
).
See example 1. The comment is embedded in-line using the hyphens indicating that anything after the hyphens is the comment text.
Example 2: Single-line comments
Orders = SELECTCOLUMNS('Company Data', "Customer ID", [Customer ID], // this is a single-line comment "Customer Name", [Customer Name], "Customer Address", [Customer Address] )
Method: Begin the comment with the //
symbol.
See example 2. The //
symbol indicates that the entire line contains comment text.
Example 3: Commenting-out multiple lines of code
Orders = SELECTCOLUMNS('Company Data', "Customer ID", [Customer ID], /* this is a multi-line comment "Customer Name", [Customer Name], "Customer Address", [Customer Address] */ )
Method: Place the code in between /*
and */
symbols.
See example 3. The specific two lines of the script won’t execute because it is commented out.
The final activity for this topic is a set of questions that will help you prepare for your formal assessment.
To be able to work with and analyse big data, data needs to be:
- first data needs to be gathered from a variety of sources (i.e. extracted)
- second, it needs to be organised together (i.e. transformed)
- third, data needs to be loaded onto a single centralised repository (i.e. loaded)
These basic processes involved in testing and preparing big data is commonly known as extract, transform and load (ETL).
Note: Throughout this module, you will come across this term ETL often. Therefore, it is important that you understand what it means and why it is important in big data testing.
Watch the following video to understand the ETL process and its benefits.
After watching the video, answer the following questions.
Data validation requires conducting routine checks to ensure that data is correct, meaningful and is free of errors.
Therefore the following protocols are used to support this data validation process.
- Testing methodologies
- Test scripting
Let’s try to understand each of these data validation protocols in detail.
Testing methodologies
The methodology of big data testing mainly involves three stages18:
- Data staging validation
- MapReduce validation
- Output validation
Data staging validation
This involves testing whether the data extracted from various sources are uploaded to the analytics platform correctly.
At this stage, you will also need to determine the need to test the complete set of data or a representative sample dataset.
Some of the validation checks performed in this stage include:
- verifying that the data is obtained from the correct source system (e.g. files, database, CRM etc.)
- verify that the data obtained from the source is accurate and not corrupted in any way (e.g. by performing checksums, hashes, counts of records, format checking etc.)
- verify that the data files required for testing are uploaded or imported into the analytical platform correctly without any errors or missing data
- verify if the source data synchronises with the data that is uploaded into the analytical platform.
These validation checks are commonly performed during extract, transform load (ETL) testing.
MapReduce(process) validation
This involves compressing the large data set into a more practical, aggregated and compact data sample.
Some of the validation checks performed in this stage include: 19
- verifying the processing of data and the business logic
- aligning the dataset to relevant parts of the organisation (e.g. source to target mapping)
- validate the MapReduce* process to ensure the correct generation of the “key-value” pairs
- validate the data after the MapReduce process
- validate the data aggregation and segregation rules implemented
- validate that any anomalies in the data are resolved
- validate the generated data output to ensure it is correct, valid and that all reporting requirements are met.
* MapReduce – Originally this term came from an algorithm called MapReduce invented by Google that divides a task into small parts and assigns them to many computers. Later the results are collected at one place and integrated to form the result dataset. This process involves two steps the Map task and the Reduce task. 20
- The Map task takes a set of data and converts it into another set of data, where the individual elements are broken down into tuples (key-value pairs)
- The Reduce task takes the output from the Map as an input and combines those data tuples into a smaller set of tuples.
Watch the following video to learn more about the MapReduce process.
Output validation
This involves loading the previously validated data by first unloading and then loading it into the target repository system. This also ensures that the generated reports from the loaded data are displayed as per organisational requirements.
Some of the validation checks performed in this stage include:
- validating the dashboard report model to ensure correct visualisations are used to display the required data
- validating that the data transformation rules are applied correctly and results are as expected
- validate the integrity of the data to ensure it is loaded into the target system successfully
- validate the output against the specific source system data to ensure there is no data corruption
- validation of reports with the required data and ensuring that all indicators are displayed correctly.
Test scripting
How and when validation checks are performed
You have previously leant that the process of validating data involves routine checks.
One approach that you can use to perform these checks at each stage of the testing process would be to 21
- manually test the output against the source data to check it is correct
- manually validate that the key metrics are calculating as expected.
It is important to understand that each time you release a new version of the big data test solution, this needs to be validated. However, carrying out these validations manually would be a complex, time consuming and burdensome task.
Automated testing
A more practical and efficient approach for carrying out validation checks is to use automated testing (also known as test scripting)21.
Running test queries and scripts on the data at each stage to ensure data quality and correctness can help detect errors that otherwise would not be obvious in manual tests.
Once these test scripts are written they can be easily manipulated and reused to perform tests for other similar test scenarios.
These scripts for data validation can be written using a variety of scripting languages depending on the tools and platforms it supports.
The final activity for this sub-topic will help you prepare for your formal assessment.
There are various protocols and techniques used when performance testing the output from big data sources.
Testing big data throughput
What is throughput?
Throughput refers to the number of transactions produced overtime during a test.
Transactions, in this case, would further refer to the background queries and scripts executed by the data analysis platform in the background for generating reports.
It is important to first have a throughput goal that the application needs to be able to handle a specific number of requests per hour.
Testing throughput
Power BI Performance Analyser helps to monitor the elements of a report that contains various visuals, DAX formulas, and queries for their performance. Once the Performance Analyser measures the processing time required to run queries, load visuals in the report this can be analysed to identify any issues and further investigate any delays that affect data throughput.
Read through Microsoft’s guidelines on Power BI Performance Analyser to find out about the protocols and techniques involved when using this tool to do the following:
- Capture performance information
- Compute the durations
- Refreshing visuals
- Refreshing an individual visual’s performance data
- Saving performance information
We will discuss more on how you can analyse the performance data that you save from the Performance Analyser tool later on in a different topic.
The final activity for this sub-topic will help you prepare for your formal assessment.
It is evident that during big data sample testing you will find a variety of issues during each big data validation phase.
All of these issues would then need to be processed and reported appropriately. There are various protocols and techniques that are followed and used in organisations in order to process and report identified issues.
Let’s explore some of these protocols and techniques.
Report issues related to raw big data
In situations where an issue is related to the raw big data sources, these need to be communicated directly with the client or other relevant stakeholders who are involved in the project.
In a situation where the testers have found major issues with the big data sample received for testing and the testing process cannot progress as it is not feasible, then this needs to be communicated immediately to the client. Various communication protocols can be used in this scenario to inform the client. Such as email, meetings (either face-to-face or online).
It is important to note that organisations have their own standard communication protocols and reporting methods as outlined in their policies and procedure documents (e.g. email format, procedure for client contact, . Therefore, as a general practice, the formal report created by the testers should also follow organisational standard reporting formats.
Report issues related to systems/platforms
If there is an issue with the testing platform (e.g. performance issues, bugs, service unavailable) this needs to be raised with the relevant technical support personnel. Generally, in an organisational setup, this is first addressed by internal technical support team who would then escalate the issue with the vendor support.
If testers are directly dealing with the vendor there may be vendor-specific communication protocols that they would have to follow in order to report issues.
For example, if there’s an issue with the services used in Microsoft Power BI, this should be reported to Microsoft support using the communication protocol specified by the vendor. Refer to Microsoft’s guidelines on this process at Power BI Desktop diagnostics collection - Power BI | Microsoft Docs
Use ticketing systems to track and manage issues
Organisations use various ticketing systems (e.g. JIRA, ServiceNow) so that they can:
- track each reported issue in detail
- pass it onto the correct person who can solve the problem
- have a documented trail of evidence related to the issue and all steps taken to solve the issue
- track the time taken to solve the problem
- have visibility of the number of issues found in the project.
The protocols and techniques used when reporting an issue is to:
- create a new ticket in the system (this will generate a unique ticket ID for the issue)
- describe what the issue is and include all necessary details relevant to the issue (e.g. how the issue was found, at what stage of testing was it found, etc.)
- categorise the issue according to its priority or urgency (high, medium, low)
- assign the report to the appropriate person who will have the expertise to look into the issue further or direct it to the client, third party contact etc. as appropriate
- have some indication of the status of the issue (e.g. open, pending, in progress, verified, solved).
Therefore, whatever ticketing system you would use in any organisation, it would usually follow the same protocols and techniques for processing and reporting issues during the big data sample testing process.
Generate formal test results reports
During and after the testing of big data samples, the results of the conducted tests can be collated into formal reports. These reports can then be presented and shared with clients, relevant stakeholders and management.
A formal test results report should include:
- relevant evidence of all tests carried out on the big data sample
- details of the raw big data sample
- sampling strategies used
- details of tools and platforms used
- validation tests performed and their results at each stage of the testing process using test case scenario documents and results sheets
- recommendations and suggested solutions for issues
- final outcome of the sample tests.
It is also important to use clear, specific and industry-related terminology to represent test results.
It is important to note that organisations have their own standard report formats and protocols for generating reports such as report templates, graph visualisation templates, style guides etc. Therefore, as a general practice, the formal report created by the testers should also follow organisational standard reporting formats.
The final activity for this sub-topic will help you prepare for your formal assessment.
You must work within your organisation’s policies and procedures for big data testing. Consider this:
- A ‘policy’ is a document which explains why a task should be undertaken
- A ‘procedure’ is a document which explains how the task should be completed
Policies and procedures for big data testing are formulated around the following key processes.
For this unit, you must be particularly familiar with the following policies and procedures.
Assembling and obtaining raw big data
This is the first step in the process of testing big data samples. Raw big data is obtained from internal and/or external sources. It is then assembled into a software program (e.g. Apache Hadoop, Power BI Desktop etc) for validation and testing. You should follow your organisation’s policies and procedures when undertaking these tasks.
Policy for obtaining raw big data | Helps to identify the internal or external sources of raw big data required for the job, any legislative requirements related to the data samples and relevant procedures to acquire the raw big data. |
---|---|
Procedure for assembling raw big data | Provides guidelines on how to compile the big data sample for testing from the previously acquired raw dataset. |
Isolating sub-standard data
Procedure for isolating sub-standard data | Provides guidelines on how to separate those types of data that do not meet the quality standards set by the organisation and are unsuitable for further use (e.g. in analysis). You should isolate this data – that is, remove it from the dataset. |
---|---|
Procedure for correcting data acquisition paths | Provides guidelines and steps on how to change, update and adjust the data source settings. |
Testing data sources and storing test results
Procedure for storing test results and associated support evidence policy | Provides guidelines on how and where to securely save the output and outcomes of the sample data tested. This also includes information on how to store all documents that contain proof of tests performed on the data set so it is secure and can be retrieved when needed. |
---|---|
Procedure for testing transactional data sources | Provides guidelines on how to test data sources that have a time dimension and become historical once the operation is complete. |
Procedure for testing non-transactional data sources | Provides guidelines on how to test data sources that are relevant to an organisation for a longer duration. |
Data cleansing
Data cleansing is done soon after extract, transform and load (ETL) testing – these policies and procedures will explain how to validate the big data sample, to ensure it is correct
ETL Testing Procedure | Involves a set of procedures for extracting data from different sources, then transforming the data and loading the data into the target system. Often done using software tools. |
---|---|
Data Cleansing Policy | Provides guidelines on how to eliminate bad records, fix any formatting issues, perform de-duplication, and correct any inconsistencies present in the data after the data is collected. |
Output Quality Assurance Policy | Provides best practices and guidelines to ensure that the test results of the big data sample are presented clearly and consistently using standard representations (e.g charts, graphs etc). May include guidelines to ensure reports can be generated in a variety of formats that support different devices. |
The final activity for this sub-topic will help you prepare for your formal assessment.
Topic Summary
Congratulations on completing your learning for this topic Big data sample testing fundamentals.
In this topic you have learnt the following fundamental concepts behind big data testing.
- Legislative requirements
- Features and formats of various big data sources
- Writing queries and scripts
- Big data validation protocols
- Performance testing
- Processing and reporting issues
- Organisational policies and procedures
Check your learning
The final activity for this topic is a set of questions that will help you prepare for your formal assessment.
Assessments
Now that you have completed the basic knowledge required for this module, you are ready to complete the following two assessment events:
- Assessment 1 (Online Quiz)
- Assessment 2 (Short answer questions).
What’s next?
Next, we will dive into the more practical tasks related to validating assembled or obtained big data samples.