Big Data

Mini-glossary: Big data terms you should know

These 20 terms are essential to learn if your IT job requires you to work with big data in any capacity.

Image: iStock/mindscanner

When it comes to assembling a list of key big data terms, it makes sense to identify terms that everyone needs to know — whether they are highly technical big data practitioners, or corporate executives who confine their big data interests to dashboard reports. These 20 big data terms hit the mark.


The discipline of using software-based algorithms and statistics to uncover meaning from data.


A mathematical formula placed in a software program that performs an analysis on a dataset.The algorithm often consists of multiple calculation steps. Its goal is to operate on data in order to solve a particular question or problem.

Behavioral analytics

An analytics methodology that uses data collected about users' behavior to understand intent and predict future actions.

Big data

Data that is not system of record data, and that meets one or more of the following criteria: it comes in extremely large datasets that exceed the size of system of record datasets; it comes in from diverse sources, including but not limited to: machine-generated data, internet-generated data, computer log data, data from social media sources, or graphics and voice-based data.

Business intelligence (BI)

A set of methodologies and tools that analyze, report, manage, and deliver information that is relevant to the business, and that includes dashboards and query/reporting tools similar to those found in analytics. One key difference between analytics and BI is that analytics uses statistical and mathematical data analysis that predicts future outcomes for situations. In contrast, BI analyzes historical data to provide insights and trends information.

Clickstream analytics

The analysis of users' online activity based on the items that users click on a web page.


A graphic report on a desktop or mobile device that gives managers and others quick summaries of activity status. This high-level graphic report often features a green light (all operations are normal), a yellow alert (there is some operational impact), or a red alert (there is an operational stoppage). This "eyeshot" visibility of events and operations enables employees to track operations status, and to quickly drill down into details whenever it is needed.

Data aggregation

The collection of data from multiple and diverse sources with the intention of bringing all of this data together into a common data repository for the purposes of reporting and analysis.

Data analyst

A person responsible for working with end business users to define the types of analytics reports needed in the business, and then capturing, modeling, preparing, and cleaning the required data for the purpose of developing analytics reports on this data that business users can act on.

Data analytics

The science of examining data with software-based queries and algorithms with the goal of drawing conclusions about that information for business decision making.

Data governance

A set of data management policies and practices defined to ensure that data availability, usability, quality, integrity, and security are maintained.

Data mining

An analytic process where data is "mined" or explored, with the goal of uncovering potentially meaningful data patterns or relationships.

Data repository

A central data storage area.

Data scientist

An expert in computer science, mathematics, statistics, and/or data visualization who develops complex algorithms and data models for the purpose of solving highly complex problems.

ETL (extract, transform, and load)

ETL enables companies to take data from one database and move it to another database. ETL is accomplished by extracting data from the database that it originally is kept in, transforming the data into a format that can be used in the database that the data is being moved to, and then loading the transformed data into the database it is being moved to. The ETL process enables companies to move data in and out of different data storage areas to create new combinations of data for analytics queries and reports.


Administered by the Apache Software Foundation, Hadoop is a batch processing software framework that enables the distributed processing of large data sets across clusters of computers.


A software/hardware in-memory computing platform from SAP designed to process high-volume transactions and real-time analytics.

Legacy system

An established computer system, application, or technology that continues to be used because of the value it provides to the enterprise.


A big data batch processing framework that breaks up a data analysis problem into pieces that are then mapped and distributed across multiple computers on the same network or cluster, or across a grid of disparate and possibly geographically separated systems. The data analytics performed on this data are then collected and combined into a distilled or "reduced" report.

System of record (SOR) data

Data that is typically found in fixed record lengths, with at least one field in the data record serving as a data key or access field. System of records data makes up company transaction files, such as orders that are entered, parts that are shipped, bills that are sent, and records of customer names and addresses.

What terms would you add?

Is there an essential big data term that we missed, or is there one that always trip you up? Tell us which terms you'd add to this mini-glossary.

Also see

Note: TechRepublic and ZDNet are CBS Interactive properties.

About Mary Shacklett

Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Prior to founding the company, Mary was Senior Vice President of Marketing and Technology at TCCU, Inc., a financial services firm; Vice President o...

Editor's Picks

Free Newsletters, In your Inbox