Stream Binary Records: Validate, Filter, Copy, and Index
Title: Filter & Compact a Binary Record File to a New File with Sparse Index
Level: Difficult
Concepts: Binary File I/O (fopen, fread, fwrite, fseek, ftell, fflush), fixed-size records, runtime sizeof checks, checksums, streaming without full-file load, sparse indexing, error handling
Scenario
A device logs fixed-size binary records to a file. Each file starts with a 16-byte header containing a 4-byte ASCII magic "LOG1", a 4-byte int version, and 8 bytes reserved (zeros). After the header, there are N records, each exactly 24 bytes in this layout:
offset +0 : long ts_ms (8 bytes; milliseconds since boot)
offset +8 : int type (4 bytes; application-defined)
offset +12: int value (4 bytes)
offset +16: int checksum (4 bytes; computed rule below)
offset +20: int reserved (4 bytes; must be 0)
Checksum rule: checksum = (int)((ts_ms & 0xFFFFFFFF) + type + value).
If any field is invalid (e.g., reserved != 0) or checksum mismatches, the record is invalid.
You must stream the input file, validate each record, optionally filter by type and by time window [t_start_ms, t_end_ms], and write only valid & selected records to the output file (using the same 24-byte format). Additionally, write a separate index file that stores a sparse index: for every K-th written record (e.g., every 100th), write a pair (record_number, output_offset) as ASCII text lines ("rec=<num> off=<offset>\n"). The process must not load the whole file into memory and must handle partial trailing bytes as corruption (ignore incomplete tail).
Platform assumption (checked at runtime):
sizeof(int) == 4andsizeof(long) == 8. If not, the function returns an error.
Problem Statement
Implement a function that:
- Opens input, output, and index files.
- Validates the header.
- Iterates records with
freadin 24-byte chunks; discards invalid or out-of-filter records. - Writes valid, selected records to the output file.
- Emits an ASCII index line for every K-th written record with its sequential record number (starting at 1 in the output stream) and byte offset within the output file (relative to its start).
- Returns counts: total read, valid, and written.
Requirements
- Allowed types only:
int,long,double,char,bool,enum, plus pointers/arrays. - Inputs:
const char *in_pathconst char *out_pathconst char *idx_pathint version_expected— expected version in header.int kth_for_index— write an index entry every K-th written record (kth_for_index ≥ 1).bool filter_by_typeint type_eq— only pass records whosetype == type_eqwhenfilter_by_type == true.bool filter_by_timelong t_start_ms,long t_end_ms— inclusive window whenfilter_by_time == true.
- Outputs:
int *out_total_read— number of full 24-byte records read (including invalid).int *out_total_valid— number of records that passed validation (before filters).int *out_total_written— number of records written to output after filters.
- Behavior:
- Header (16 bytes): read and check
magic == "LOG1",version == version_expected, and reserved 8 bytes all zero. - Each record: read exactly 24 bytes; if fewer bytes remain (EOF partial), stop without error.
- Validate:
reserved == 0andchecksum == ((int)((ts_ms & 0xFFFFFFFF) + type + value)). - If
filter_by_type, keep onlytype == type_eq. - If
filter_by_time, keep onlyt_start_ms ≤ ts_ms ≤ t_end_ms. - For every kept record, write 24 bytes to
out_path. For every K-th kept record, write"rec=<n> off=<offset>\n"toidx_path, whereoffset = (long)16 + (long)24 * (n - 1)if output starts with the same 16-byte header as input; however, in this task the output has no header, sooffset = (long)24 * (n - 1).
- Header (16 bytes): read and check
- Error handling:
- Any invalid pointers, open failures, runtime
sizeofmismatch, failed writes/reads (beyond EOF conditions), or invalid parameters → return-1(and do not modify outputs). - On success, write all three output counters and return
0.
- Any invalid pointers, open failures, runtime
Function Details
- Name:
filter_compact_binary_log - Arguments:
const char *in_pathconst char *out_pathconst char *idx_pathint version_expectedint kth_for_indexbool filter_by_typeint type_eqbool filter_by_timelong t_start_mslong t_end_msint *out_total_readint *out_total_validint *out_total_written
- Return Value:
int—0on success;-1on invalid input or any I/O failure.
- Description:
Stream the input log, record-by-record. Validate header and each record, apply filters, write selected records to the output file (no header), and emit sparse index lines to the index file on every K-th written record with the recorded output offset. The function must not leak resources; ensure all opened files are closed on every return path.
Solution Approach
- Validate pointers and parameters (
kth_for_index ≥ 1, and iffilter_by_time, ensuret_start_ms ≤ t_end_ms). - Check runtime sizes:
if (sizeof(int) != 4 || sizeof(long) != 8) return -1;. - Open
in_pathin"rb",out_pathin"wb", andidx_pathin"wb". - Read and validate header:
- 4 bytes magic: must be
'L','O','G','1'. - 4 bytes
int version: equalsversion_expected. - 8 bytes reserved: all zero.
- 4 bytes magic: must be
- Loop:
- Read 24 bytes; if
freadreturns less than 24 due to EOF and no error, stop cleanly. - Unpack fields in host endianness (assume same writer/reader platform per teaching scope).
- Validate
reserved==0and checksum rule. - If filters enabled, check
typeandts_mswindow. - If passing,
fwritethe 24 bytes to output; increment written count. - If
(written % kth_for_index) == 0, computeoffset = (long)24 * (written - 1)and write an ASCII line toidx_path.
- Read 24 bytes; if
- On exit, set output counters, close all files, return
0.
Tasks to Perform
- Validate all string and counter pointers are non-
NULL. Validatekth_for_index ≥ 1. Iffilter_by_time, ensuret_start_ms ≤ t_end_ms. - Verify runtime sizes:
sizeof(int)==4,sizeof(long)==8; otherwise return-1. - Open input/output/index files; on any open failure, return
-1. - Read and validate the 16-byte header (
"LOG1", matchingversion_expected, reserved zeros). - Initialize counters:
total_read=0,total_valid=0,total_written=0. - While true:
- Attempt to read 24 bytes. If <24 and EOF, break; if error, return
-1. - Parse fields and validate
reservedandchecksum. - If valid, increment
total_valid. - Apply filters; if accepted:
- Write the raw 24 bytes to output; on failure, return
-1. - Increment
total_written. Iftotal_written % kth_for_index == 0, write the ASCII index line with computedoffset.
- Write the raw 24 bytes to output; on failure, return
- Increment
total_readfor each full 24-byte chunk consumed (regardless of validity).
- Attempt to read 24 bytes. If <24 and EOF, break; if error, return
- Flush and close all files. Set
*out_total_read,*out_total_valid,*out_total_written. Return0.
Test Cases
| # | Inputs / Precondition | Expected Output | Notes |
|---|---|---|---|
| 1 | Valid header; 3 valid records; no filters; kth_for_index=2 |
ret=0; total_read=3; total_valid=3; total_written=3; idx has entries for rec=2 (off=24) |
Basic path |
| 2 | Valid header; records with some bad checksum/reserved | ret=0; total_read=N; total_valid=M<N; total_written=M |
Invalid dropped; counts reflect |
| 3 | Type filter on type_eq=5 |
Only records where type==5 are written; index lines at 5th, 10th… written |
Filter by type |
| 4 | Time filter [t_start_ms, t_end_ms] |
Only records within window kept | Filter by time |
| 5 | Partial trailing bytes (file truncated mid-record) | ret=0; total_read excludes partial; no error |
Graceful EOF |
| 6 | Wrong header magic or version | ret=-1 |
Header validation fails |
| 7 | kth_for_index=0 or sizeof mismatch at runtime |
ret=-1 |
Parameter/runtime guard |
| 8 | Write failure (e.g., disk full) | ret=-1 |
Robust error propagation |
| 9 | Mixed valid records; check index offsets | Index lines: rec=K off=24*(K-1) with no header in output |
Offset rule |