Beginner

Big Data

Aicademy

·A-Level Computer Science·AQA 7517·6 min

4.11.1 Big Data

What is Big Data?

Big Data refers to datasets so large, fast-arriving, or varied in structure that traditional database systems cannot process them effectively. It is characterised by three properties:

Property	Description	Example
Volume	Data is too large for a single server	All tweets ever posted; genomic sequences for millions of patients
Velocity	Data arrives as a continuous, near-real-time stream	Stock market ticks; IoT sensor readings; social media feeds
Variety	Data has diverse formats — structured, unstructured, and semi-structured	Text posts, video files, GPS coordinates, purchase records

A single application may exhibit all three: a social media platform generates high volume, high velocity data in many formats (text, images, video, location data, interaction logs).

Why Relational Databases Are Unsuitable

Traditional relational databases assume:

Data has a fixed schema: columns are defined before data is inserted
Data is structured: every row conforms to the same format
Queries run on a single server (or a small cluster)

Big Data breaks all three assumptions:

Relational assumption	Big Data reality
Fixed schema (table columns pre-defined)	Unstructured data (tweets have no fixed schema)
Structured rows	Mixed formats: text, images, JSON, binary
Single-server queries	Data volume requires distributed processing across many servers
Consistent rows	Streaming data arrives before a schema can be imposed

Solution: distributed processing frameworks split data across many machines and process it in parallel. Each node handles a subset of the data, and results are aggregated. This allows processing of datasets far too large for any single machine.

Machine Learning for Pattern Discovery

Big Data's value comes from discovering patterns that are hidden in very large, unstructured datasets. Traditional SQL queries find facts you already know to ask about. Machine learning finds patterns you didn't know to look for.

Why machine learning is needed:

Unstructured data (text, images, audio) cannot be queried with SQL
The patterns are too subtle or complex for hand-coded rules
The volume is too large for manual inspection

Examples:

Recommending products based on millions of purchase histories
Detecting fraudulent transactions by finding anomalies in billions of records
Identifying cancer markers in medical imaging datasets of hundreds of thousands of scans
Predicting traffic patterns from millions of GPS traces

Machine learning algorithms identify statistical patterns in training data and generalise them to new data — without being explicitly programmed with rules for each case.

Functional Programming for Big Data

Functional programming is well-suited to distributed Big Data processing because its core properties eliminate classes of bugs that are catastrophic in distributed systems:

FP property	Why it matters for Big Data
Immutable data	Data is never modified in place; transforms produce new datasets → no accidental overwrites or race conditions when running on many parallel nodes
Stateless functions	A function's output depends only on its input, not on shared state → functions can safely run in parallel on different machines without side effects
Higher-order functions	`map`, `filter`, `reduce` express distributed patterns cleanly: map the function across all nodes; reduce the results back

The MapReduce paradigm (used in Hadoop and similar systems) directly implements functional programming concepts: map a function across distributed data shards, then reduce the partial results into a final answer.

Studying this for an exam?

Generate a personalised learning path for this subject. Free to get started.

Create a learning path

Fact-Based Model and Graph Schema

Fact-based model

Traditional relational databases update records in place (an UPDATE statement overwrites old data). For Big Data analytics, a fact-based model is often used instead:

Each new piece of information is stored as an immutable fact appended to the dataset
Facts are never updated or deleted — only additions are made
The "current state" is derived by querying all relevant facts

Example: instead of updating a customer's address (losing the old address), a new fact (CustomerID, Address, Timestamp) is appended. All past addresses are preserved for audit and analysis.

This approach is aligned with functional programming's immutability principle.

Graph schema

A graph schema represents a dataset using three elements:

Nodes — entities (a person, a product, a location)
Edges — relationships between nodes (Alice BOUGHT a Laptop; Bob FOLLOWS Alice)
Properties — attributes of nodes or edges (the Laptop has price=£999; Alice FOLLOWS Bob since 2022-01-01)

Graph schemas are more flexible than relational tables — new relationship types can be added without altering a schema. They are particularly powerful for social networks, knowledge graphs, and recommendation systems where relationships are first-class data.

Common Exam Mistakes

1. Stating relational databases "can't handle" Big Data

Relational databases are unsuitable — not technically incapable. The problems are scale (volume requires distribution), schema rigidity (variety requires flexible structure), and real-time processing (velocity requires stream handling). A well-managed relational DB can handle moderate scale; it fails at Big Data's extremes.

2. Confusing the three Vs

Volume = size (how much). Velocity = speed (how fast it arrives). Variety = formats (how different). Each is a distinct problem requiring different solutions. Answers that conflate them lose marks.

3. Omitting immutability from the functional programming explanation

The key properties that make functional programming suitable for Big Data are immutability, statelessness, and higher-order functions. Simply saying "functional programming is faster" or "it supports parallel processing" without explaining WHY (immutable + stateless → safe parallelism) is an incomplete answer.

4. Describing a fact-based model as the same as a relational database

A relational database updates records in place. A fact-based model appends immutable facts and never modifies them. The distinction is the update vs append approach — this is a meaningful conceptual difference, not just implementation detail.