Beginner

Big Data

AicademyAicademy
·A-Level Computer Science·AQA 7517·6 min
4.11.1 Big Data

What is Big Data?

Big Data refers to datasets so large, fast-arriving, or varied in structure that traditional database systems cannot process them effectively. It is characterised by three properties:

PropertyDescriptionExample
VolumeData is too large for a single serverAll tweets ever posted; genomic sequences for millions of patients
VelocityData arrives as a continuous, near-real-time streamStock market ticks; IoT sensor readings; social media feeds
VarietyData has diverse formats — structured, unstructured, and semi-structuredText posts, video files, GPS coordinates, purchase records

A single application may exhibit all three: a social media platform generates high volume, high velocity data in many formats (text, images, video, location data, interaction logs).

Why Relational Databases Are Unsuitable

Traditional relational databases assume:

  • Data has a fixed schema: columns are defined before data is inserted
  • Data is structured: every row conforms to the same format
  • Queries run on a single server (or a small cluster)

Big Data breaks all three assumptions:

Relational assumptionBig Data reality
Fixed schema (table columns pre-defined)Unstructured data (tweets have no fixed schema)
Structured rowsMixed formats: text, images, JSON, binary
Single-server queriesData volume requires distributed processing across many servers
Consistent rowsStreaming data arrives before a schema can be imposed

Solution: distributed processing frameworks split data across many machines and process it in parallel. Each node handles a subset of the data, and results are aggregated. This allows processing of datasets far too large for any single machine.

Machine Learning for Pattern Discovery

Big Data's value comes from discovering patterns that are hidden in very large, unstructured datasets. Traditional SQL queries find facts you already know to ask about. Machine learning finds patterns you didn't know to look for.

Why machine learning is needed:

  • Unstructured data (text, images, audio) cannot be queried with SQL
  • The patterns are too subtle or complex for hand-coded rules
  • The volume is too large for manual inspection

Examples:

  • Recommending products based on millions of purchase histories
  • Detecting fraudulent transactions by finding anomalies in billions of records
  • Identifying cancer markers in medical imaging datasets of hundreds of thousands of scans
  • Predicting traffic patterns from millions of GPS traces

Machine learning algorithms identify statistical patterns in training data and generalise them to new data — without being explicitly programmed with rules for each case.

Functional Programming for Big Data

Functional programming is well-suited to distributed Big Data processing because its core properties eliminate classes of bugs that are catastrophic in distributed systems:

FP propertyWhy it matters for Big Data
Immutable dataData is never modified in place; transforms produce new datasets → no accidental overwrites or race conditions when running on many parallel nodes
Stateless functionsA function's output depends only on its input, not on shared state → functions can safely run in parallel on different machines without side effects
Higher-order functionsmap, filter, reduce express distributed patterns cleanly: map the function across all nodes; reduce the results back

The MapReduce paradigm (used in Hadoop and similar systems) directly implements functional programming concepts: map a function across distributed data shards, then reduce the partial results into a final answer.

Studying this for an exam?

Generate a personalised learning path for this subject. Free to get started.

Create a learning path

Fact-Based Model and Graph Schema

Fact-based model

Traditional relational databases update records in place (an UPDATE statement overwrites old data). For Big Data analytics, a fact-based model is often used instead:

  • Each new piece of information is stored as an immutable fact appended to the dataset
  • Facts are never updated or deleted — only additions are made
  • The "current state" is derived by querying all relevant facts

Example: instead of updating a customer's address (losing the old address), a new fact (CustomerID, Address, Timestamp) is appended. All past addresses are preserved for audit and analysis.

This approach is aligned with functional programming's immutability principle.

Graph schema

A graph schema represents a dataset using three elements:

  • Nodes — entities (a person, a product, a location)
  • Edges — relationships between nodes (Alice BOUGHT a Laptop; Bob FOLLOWS Alice)
  • Properties — attributes of nodes or edges (the Laptop has price=£999; Alice FOLLOWS Bob since 2022-01-01)

Graph schemas are more flexible than relational tables — new relationship types can be added without altering a schema. They are particularly powerful for social networks, knowledge graphs, and recommendation systems where relationships are first-class data.

Common Exam Mistakes

1. Stating relational databases "can't handle" Big Data

Relational databases are unsuitable — not technically incapable. The problems are scale (volume requires distribution), schema rigidity (variety requires flexible structure), and real-time processing (velocity requires stream handling). A well-managed relational DB can handle moderate scale; it fails at Big Data's extremes.

2. Confusing the three Vs

Volume = size (how much). Velocity = speed (how fast it arrives). Variety = formats (how different). Each is a distinct problem requiring different solutions. Answers that conflate them lose marks.

3. Omitting immutability from the functional programming explanation

The key properties that make functional programming suitable for Big Data are immutability, statelessness, and higher-order functions. Simply saying "functional programming is faster" or "it supports parallel processing" without explaining WHY (immutable + stateless → safe parallelism) is an incomplete answer.

4. Describing a fact-based model as the same as a relational database

A relational database updates records in place. A fact-based model appends immutable facts and never modifies them. The distinction is the update vs append approach — this is a meaningful conceptual difference, not just implementation detail.

Generate revision on any topic you study

Type any topic you're studying and Aicademy generates a complete lesson, quiz, and flashcard set — personalised to your level.

Lessons on anything

Structured, level-matched lessons on any topic you study

Practice quizzes

Find out what you actually know before the exam does

Flashcard sets

Lock in key concepts with instant revision cards

Ask Aica

Stuck on something? Get a clear explanation, any time

Prev

SQL and Client-Server Databases

Next

Functional Programming Paradigm

Related lessons

6 Slides

Lesson

SQL and Client-Server Databases

A-Level Computer Science · AQA 7517

10 hours ago

6 Slides

Lesson

Functional Programming Paradigm

A-Level Computer Science · AQA 7517

10 hours ago