COMP3211 Advanced Databases - 1 Hour Exam Cheatsheet

1. Information Retrieval

Appears every year. Learn the routine.

Precision / Recall

Precision = relevant retrieved / retrieved
Recall = relevant retrieved / total relevant

For ranked answers, scan left to right. Each time a relevant doc appears, update precision and recall. Standard recall levels are usually 0%, 10%, 20%, ... until unattainable.

Term Selection

Stop-word removal: remove common words: the, of, and, in.
Stemming/lemmatisation: reduce variants: stars/star, databases/database.
Case folding: UNIX/unix treated the same unless stated otherwise.

Binary Term Vector

Make one dimension per selected term. Document has 1 if term appears, else 0. Query vector uses same dimensions.

TF-IDF

tf-idf(t,d) = tf(t,d) * idf(t)
tf(t,d) = count(t in d) / total selected terms in d
idf(t) = log2(N / df(t))

If term is absent, weight is 0. If it appears once in a 4-term title, tf=1/4, so weight is idf/4. Use the paper's exact term set after stop-word removal/stemming.

Rocchio Feedback

q' = alpha*q + beta/|Dr| * sum(d in Dr) d - gamma/|Dnr| * sum(d in Dnr) d

Relevant documents push query towards their terms. Irrelevant documents push query away. If constants not given, explain qualitatively or use symbolic alpha/beta/gamma.

2. Access Structures

Core Q1 topic: B/B+ trees, indexes, hashing, block maths.

Record / Block Maths

records per block = floor(usable block bytes / record bytes)
blocks = ceil(tuple count / records per block)

Use 4000 bytes usable when stated. Records do not span blocks.

Encoding Attributes

Choose minimum fixed bytes that can represent distinct values/range. Common rule: bytes needed = smallest b where 2^(8b) >= values.

Dense / Sparse Index

Index benefits: fewer disk accesses for lookup/range queries, can support ordering, joins, uniqueness checks.
Clustering: physically store records near/sorted by search key, so related records need fewer block reads.
Dense: one key/pointer pair for every record in the data file; needed for secondary indexes.
Sparse: one entry per data block; requires ordered/clustered file.
Secondary non-key: usually key -> bucket of record pointers.

B+Tree Capacity

primary leaf capacity lp = floor((b - p) / record_len)
secondary leaf capacity ls = floor((b - p) / (key_len + p))
non-leaf fanout i = floor((b - p) / (key_len + p))
effective capacity = floor(capacity * occupancy), e.g. occupancy 0.6

In the lecture arithmetic, primary-index leaves hold records; secondary-index leaves hold key + record-pointer entries. Non-leaf entries use key + block pointer, with one extra pointer accounted for by b-p.

B+Tree Order n Rules

non-leaf max pointers = n+1, max keys = n
non-leaf min pointers = ceil((n+1)/2), min keys = ceil((n+1)/2)-1
leaf max data pointers = n+1, min data pointers = floor((n+1)/2)
root minimum = 1 key/pointer unless empty

B/B+Tree Operations

Insert into correct leaf. If overflow, split and promote separator.
If parent overflows, split parent recursively.
B+Tree leaves are linked by sequence pointers for efficient range scans.
Deletion: if underfull, borrow from sibling if possible, otherwise merge and update parent.

Extensible Hashing

Directory uses global depth i bits of hash.
Bucket has local depth. On overflow, split bucket.
If local depth equals global depth, double directory first.

Bitmap Index

One bitmap per value/category. Query uses Boolean ops: AND for combined conditions, OR for alternatives, NOT for negation.

3. Relational Algebra + Query Processing

Repeated in 2022-23, 2023-24, 2024-25.

Canonical SQL to RA Tree

Start with Cartesian product of all FROM relations.
Put full WHERE predicate as selection above product.
Put projection at top for SELECT attributes.

Optimise RA Tree

Push selections down as far as possible.
Push projections down, keeping join attributes.
Replace product + join predicate with join.
Do most selective joins/selections early.

Useful Cardinality Estimates

Selection A=c: T(R) / V(R,A)
Range selection: use given fraction or reasonable assumption
Join R.A=S.B: T(R)*T(S) / max(V(R,A), V(S,B))

For key-foreign-key join, result often equals size of foreign-key side if every FK matches.

Nested Loop Join Cost

Block nested loop: B(outer) + ceil(B(outer)/(M-2)) * B(inner)

Choose smaller relation as outer to minimise cost. M is buffer blocks.

Logical vs Physical Plan

Logical: algebraic operators: select, project, join.
Physical: concrete algorithms/access paths: scan, index lookup, nested loop, hash join, sort-merge.

System Catalogue

Stores schemas, indexes, constraints, tuple counts, block counts, distinct value counts. Optimiser needs it for cost/cardinality estimates.

4. Transactions, Concurrency, Logging

Very important in recent papers.

ACID

Atomicity: all or nothing.
Consistency: constraints preserved.
Isolation: concurrent result equivalent to serial execution.
Durability: committed effects survive crashes.

Concurrency Problems

Lost update: two transactions read same old value; later write overwrites earlier update.
Dirty read: reads uncommitted data.
Unrepeatable read: same read gives different values.

Serialisability

Non-serial schedule is OK if equivalent to some serial order. For conflict serialisability, build precedence graph; cycle means not serialisable.

Timestamp Ordering

Read by T allowed if TS(T) >= WTS(X); then set RTS(X)=max(RTS(X), TS(T)).
Write by T allowed if TS(T) >= RTS(X) and TS(T) >= WTS(X); then set WTS(X)=TS(T).
Otherwise abort/restart T.

Undo Logging

Logging mainly supports atomicity and durability; isolation is handled by concurrency control.
Log old values: <T, X, old>.
Log record must reach disk before changed data item.
Commit record written after all changed data has reached disk.
Recovery: undo transactions without commit, scanning backwards.

Redo Logging

Log new values: <T, X, new>.
Commit record reaches disk before changed data pages.
Recovery: redo committed transactions, usually scanning forwards.

Checkpointing

Reduces how far back recovery must scan. Non-quiescent checkpoint records active transactions; recovery starts from checkpoint plus active transactions.

5. Distributed + Parallel Databases

Often appears as a big applied question.

Fragmentation

Distributed DBMS design dimensions: distribution, heterogeneity, autonomy.
Horizontal: split rows using predicates.
Vertical: split columns; include key to reconstruct.
Fragment reduction/localisation: use query predicate to ignore irrelevant fragments.

Parallel Join Patterns

Direct/local join: join matching fragments where they already live.
Broadcast join: send smaller relation/fragment to all nodes holding larger relation fragments.
Repartitioned join: hash/range repartition both relations on join key, then local joins.
Union: collect partial results from workers.

Distributed Join

Ship whole: send R to S site or S to R site; choose lower transfer + join cost.
Semijoin: send projected join attributes, filter remote relation, send only matching tuples back.
Semijoin is bad if projected keys are large or most remote tuples match anyway.

2PC

Coordinator asks participants to vote.
If all vote yes, coordinator commits; otherwise aborts.
Coordinator timeout in WAIT: may abort because a missing vote means it cannot safely commit.
Coordinator timeout in COMMIT/ABORT: resend the global decision to participants without acknowledgements.
Participant timeout in INITIAL: may unilaterally abort. Participant timeout in READY: blocked unless another participant/coordinator knows the decision.
Presumed abort reduces logging/messages for aborts/read-only; presumed commit optimises commits.

Parallel Cost Assumptions

Read the question carefully. Past paper assumptions include free scan/union, aggregation cost Card(R), and indexed join cost using smaller indexed side.

6. Fast Exam Strategy

Question Choice

You usually answer 3 of 4. Pick the questions with the most mechanical marks first:

IR: almost always highly template-based.
Access structures: good if you practise block/B-tree maths.
RA/query processing: good if you can draw trees and show working.
Logging/concurrency: good if you know algorithms exactly.

Common Mark Traps

Do not forget floor for entries per block and ceil for number of blocks.
Always state assumptions if selectivity/range distribution is ambiguous.
For RA, keep join attributes when pushing projections.
For B+Tree, remember leaf sequence pointers support range queries.
For recovery, separate committed from uncommitted transactions before deciding redo/undo.

One-Hour Revision Order

Minutes	Do this
0-10	Memorise IR formulas and Rocchio routine.
10-25	Practise block maths, B+Tree capacity, dense/sparse/secondary indexes.
25-38	Review RA tree rewrite steps and join cost formula.
38-50	Review ACID, timestamp rules, undo/redo recovery.
50-60	Skim distributed/parallel joins, semijoin, 2PC states.

Low Priority Unless Needed

Non-relational databases, decentralised databases, broad intro material. Conjunctive query containment appeared in 2021-22 but is not obvious in the current lecture list.

COMP3211 Advanced Databases - Last-Hour Cheatsheet