Big Data
What is Big Data?
Big Data refers to datasets so large, fast-arriving, or varied in structure that traditional database systems cannot process them effectively. It is characterised by three properties:
| Property | Description | Example |
|---|---|---|
| Volume | Data is too large for a single server | All tweets ever posted; genomic sequences for millions of patients |
| Velocity | Data arrives as a continuous, near-real-time stream | Stock market ticks; IoT sensor readings; social media feeds |
| Variety | Data has diverse formats — structured, unstructured, and semi-structured | Text posts, video files, GPS coordinates, purchase records |
A single application may exhibit all three: a social media platform generates high volume, high velocity data in many formats (text, images, video, location data, interaction logs).
Why Relational Databases Are Unsuitable
Traditional relational databases assume:
- Data has a fixed schema: columns are defined before data is inserted
- Data is structured: every row conforms to the same format
- Queries run on a single server (or a small cluster)
Big Data breaks all three assumptions:
| Relational assumption | Big Data reality |
|---|---|
| Fixed schema (table columns pre-defined) | Unstructured data (tweets have no fixed schema) |
| Structured rows | Mixed formats: text, images, JSON, binary |
| Single-server queries | Data volume requires distributed processing across many servers |
| Consistent rows | Streaming data arrives before a schema can be imposed |
Solution: distributed processing frameworks split data across many machines and process it in parallel. Each node handles a subset of the data, and results are aggregated. This allows processing of datasets far too large for any single machine.
Machine Learning for Pattern Discovery
Big Data's value comes from discovering patterns that are hidden in very large, unstructured datasets. Traditional SQL queries find facts you already know to ask about. Machine learning finds patterns you didn't know to look for.
Why machine learning is needed:
- Unstructured data (text, images, audio) cannot be queried with SQL
- The patterns are too subtle or complex for hand-coded rules
- The volume is too large for manual inspection
Examples:
- Recommending products based on millions of purchase histories
- Detecting fraudulent transactions by finding anomalies in billions of records
- Identifying cancer markers in medical imaging datasets of hundreds of thousands of scans
- Predicting traffic patterns from millions of GPS traces
Machine learning algorithms identify statistical patterns in training data and generalise them to new data — without being explicitly programmed with rules for each case.
Functional Programming for Big Data
Functional programming is well-suited to distributed Big Data processing because its core properties eliminate classes of bugs that are catastrophic in distributed systems:
| FP property | Why it matters for Big Data |
|---|---|
| Immutable data | Data is never modified in place; transforms produce new datasets → no accidental overwrites or race conditions when running on many parallel nodes |
| Stateless functions | A function's output depends only on its input, not on shared state → functions can safely run in parallel on different machines without side effects |
| Higher-order functions | map, filter, reduce express distributed patterns cleanly: map the function across all nodes; reduce the results back |
The MapReduce paradigm (used in Hadoop and similar systems) directly implements functional programming concepts: map a function across distributed data shards, then reduce the partial results into a final answer.
Studying this for an exam?
Generate a personalised learning path for this subject. Free to get started.
Fact-Based Model and Graph Schema
Fact-based model
Traditional relational databases update records in place (an UPDATE statement overwrites old data). For Big Data analytics, a fact-based model is often used instead:
- Each new piece of information is stored as an immutable fact appended to the dataset
- Facts are never updated or deleted — only additions are made
- The "current state" is derived by querying all relevant facts
Example: instead of updating a customer's address (losing the old address), a new fact (CustomerID, Address, Timestamp) is appended. All past addresses are preserved for audit and analysis.
This approach is aligned with functional programming's immutability principle.
Graph schema
A graph schema represents a dataset using three elements:
- Nodes — entities (a person, a product, a location)
- Edges — relationships between nodes (Alice
BOUGHTa Laptop; BobFOLLOWSAlice) - Properties — attributes of nodes or edges (the Laptop has
price=£999; AliceFOLLOWSBob since2022-01-01)
Graph schemas are more flexible than relational tables — new relationship types can be added without altering a schema. They are particularly powerful for social networks, knowledge graphs, and recommendation systems where relationships are first-class data.
Common Exam Mistakes
1. Stating relational databases "can't handle" Big Data
Relational databases are unsuitable — not technically incapable. The problems are scale (volume requires distribution), schema rigidity (variety requires flexible structure), and real-time processing (velocity requires stream handling). A well-managed relational DB can handle moderate scale; it fails at Big Data's extremes.
2. Confusing the three Vs
Volume = size (how much). Velocity = speed (how fast it arrives). Variety = formats (how different). Each is a distinct problem requiring different solutions. Answers that conflate them lose marks.
3. Omitting immutability from the functional programming explanation
The key properties that make functional programming suitable for Big Data are immutability, statelessness, and higher-order functions. Simply saying "functional programming is faster" or "it supports parallel processing" without explaining WHY (immutable + stateless → safe parallelism) is an incomplete answer.
4. Describing a fact-based model as the same as a relational database
A relational database updates records in place. A fact-based model appends immutable facts and never modifies them. The distinction is the update vs append approach — this is a meaningful conceptual difference, not just implementation detail.
Generate revision on any topic you study
Type any topic you're studying and Aicademy generates a complete lesson, quiz, and flashcard set — personalised to your level.
Lessons on anything
Structured, level-matched lessons on any topic you study
Practice quizzes
Find out what you actually know before the exam does
Flashcard sets
Lock in key concepts with instant revision cards
Ask Aica
Stuck on something? Get a clear explanation, any time
SQL and Client-Server Databases
Functional Programming Paradigm
Related lessons
6 Slides
6 Slides