What is the maximum file size I can upload to S3, GCS, or Azure Blob?

The limits vary by provider but are all in the terabyte range: - **AWS S3**: Maximum object size is 5 TB. Multipart upload supports up to 10,000 parts, with each part between 5 MB and 5 GB. So a 5 TB file would use 10,000 parts of ~500 MB each. - **Google Cloud Storage**: Maximum object size is 5 TB. Resumable uploads support objects up to 5 TB with no limit on the number of chunks. - **Azure Blob Storage**: Maximum block blob size is approximately 190.7 TB (with 50,000 blocks of up to 4,000 MB each). In practice, most uploads stay well under this limit. For ML workloads, the practical limit is usually network bandwidth and time, not the storage provider's size limit. A 100 GB dataset on a 100 Mbps upload connection takes about 2.2 hours -- on a 1 Gbps connection, about 13 minutes with parallel multipart upload.

Should I use presigned URLs or proxy the upload through my server?

**Use presigned URLs (or SAS tokens) for production systems.** The reasons are compelling: 1. **Scalability**: Your server handles only lightweight JSON requests, not file bytes. A single small API server can coordinate thousands of concurrent uploads. 2. **Cost**: No compute cost for the transfer -- the cloud provider handles it. 3. **Reliability**: Cloud storage upload endpoints are highly available. Your application server is an additional point of failure. 4. **Simplicity at scale**: No need to manage disk space, file descriptors, or upload timeouts on your server. Proxy upload is only appropriate for prototypes, small internal tools, or scenarios where you absolutely must inspect/transform file content during upload (rare in ML pipelines). If your files are under 10 MB and your upload volume is fewer than 100 per day, proxy upload is fine. Beyond that, switch to presigned URLs.

How do I handle file format detection -- should I trust the file extension?

**Never trust file extensions.** A file named `dataset.csv` could be a ZIP archive, a binary executable, or a corrupted JPEG. File extensions are a user-provided label with no technical enforcement. Instead, use **magic byte detection**. Every file format has characteristic bytes at the beginning (or specific offsets) that identify it: - **CSV**: No magic bytes, but starts with printable ASCII/UTF-8 characters (detect by exclusion) - **JSON**: Starts with `{`, `[`, or whitespace followed by one of these - **Parquet**: Magic bytes `PAR1` at the start and end of the file - **PNG**: `\x89PNG\r\n\x1a\n` - **JPEG**: `\xff\xd8\xff` - **ZIP**: `PK\x03\x04` - **gzip**: `\x1f\x8b` In Python, use `python-magic` (a libmagic wrapper) to detect format: `magic.Magic(mime=True).from_file('dataset.csv')` returns the actual MIME type. In your post-upload processor, always compare the detected format against the declared format and reject mismatches.

How does schema inference work for CSV files, and how reliable is it?

CSV schema inference works by **sampling rows and inferring column types** using a type lattice. Here is how most implementations (including PyArrow and pandas) handle it: 1. Read the first $n$ rows (typically 1,000-10,000) as strings 2. For each column, try to parse values in order of specificity: boolean -> integer -> float -> datetime -> string 3. The inferred type is the most specific type that successfully parses all sampled values 4. If any value is null/empty, the column is marked as nullable Reliability depends on sample size and data quality: - **High reliability**: Columns with consistent types across all rows. A column that's always integers will be correctly inferred as integer. - **Moderate reliability**: Columns where the type changes deep in the file. If row 1-1000 has integers but row 50,000 has `N/A`, sampling only 1,000 rows will miss this. Increase sample size or sample from multiple positions in the file. - **Low reliability**: Columns with ambiguous values. Is `01` an integer, a string, or a zero-padded code? Is `2024-01-15` a date or a string? These require explicit schema declarations. For critical ML pipelines, **never rely solely on inference**. Use inferred schemas as a starting point, then register and enforce explicit schemas. Parquet files avoid this problem entirely because the schema is embedded in the file.

What is the tus protocol and when should I use it instead of S3 multipart upload?

The **tus** protocol (https://tus.io) is an open standard for resumable file uploads over HTTP. Unlike S3 multipart upload, which is AWS-specific, tus is cloud-agnostic and works with any backend that implements the protocol. **How it works**: The client creates an upload resource on the server, then sends file data in chunks via PATCH requests. The server tracks the upload offset. If the connection drops, the client asks the server for the current offset and resumes from that point. **Use tus when**: - You want cloud-agnostic upload that works with S3, GCS, Azure, or local storage - You need a standard protocol that client libraries exist for (JavaScript, Python, Java, Go, iOS, Android) - You are building a user-facing upload UI and want to use Uppy (which has first-class tus integration) - Your upload server needs to perform custom validation or transformation during upload **Use S3 multipart upload when**: - You are committed to AWS and want to minimize moving parts - You need maximum throughput (direct to S3 without an intermediary server) - You want to use presigned URLs for direct browser-to-S3 upload In practice, many teams use **tus for user-facing uploads** (web UI, annotation platform) and **S3 multipart for programmatic bulk uploads** (CLI tools, automated data pipelines).

How should I handle virus scanning for uploaded ML datasets?

Virus scanning is a non-negotiable step in any file upload pipeline, even for ML datasets from trusted internal sources. Here is the standard approach: **Architecture**: Deploy ClamAV as a daemon (`clamd`) in a Docker container. Your post-upload processor sends files to ClamAV for scanning. Clean files are moved to the approved storage bucket; infected files are quarantined in a separate, access-restricted bucket. **Timing**: Always scan **before** any processing. The file should move through three states: `uploaded` (raw, in staging bucket) -> `scanned` (ClamAV cleared) -> `processed` (schema inferred, metadata recorded, ready for pipeline). Never let unscanned files reach the processing pipeline. **Performance**: ClamAV processes files at approximately 25-50 MB/s per core. For a 1 GB CSV, that is 20-40 seconds. For bulk uploads, run multiple scanner workers in parallel. Cloud-native alternatives like Microsoft Defender for Storage and AWS GuardDuty S3 Protection provide managed scanning without self-hosting ClamAV. **Special considerations for ML**: Beyond traditional malware, ML pipelines face unique threats. Python **pickle files** can execute arbitrary code when loaded -- never accept pickle uploads from external users. Serialized model files (`.pt`, `.h5`) can also contain malicious payloads. For these formats, use safe loading modes (`torch.load(map_location='cpu', weights_only=True)`) and consider sandboxed execution environments. **Cost**: ClamAV is free and open-source. A dedicated scanning instance (c5.large on AWS) costs approximately $62/month (~INR 5,200/month) and can handle about 2 TB/day of scanning throughput.

How do I handle file uploads in a multi-tenant ML platform?

Multi-tenancy in file upload requires isolation at three levels: **storage**, **processing**, and **metadata**. **Storage isolation**: Use per-tenant prefixes in your cloud storage bucket: `s3://ml-uploads/{tenant_id}/{upload_id}/{filename}`. This allows you to set per-tenant IAM policies, storage quotas, and lifecycle rules. For strict isolation requirements (e.g., healthcare or financial data), use separate buckets per tenant. **Processing isolation**: Each uploaded file must carry tenant context through the entire post-processing pipeline. Presigned URLs should encode the tenant ID in the object key. Post-upload processors should validate that the file was uploaded to the correct tenant path. Schema registries should be tenant-scoped so that tenant A's CSV schema doesn't interfere with tenant B's. **Metadata isolation**: Upload records in your metadata store must include tenant ID as a partition key. Queries for listing uploads, checking quotas, and retrieving schemas must always filter by tenant. This is straightforward with PostgreSQL row-level security or a tenant ID column with enforced filtering. **Quotas and rate limiting**: Each tenant should have configurable limits on: storage volume (GB), upload frequency (uploads per hour), file size (max MB per file), and concurrent uploads. These prevent a single tenant from monopolizing upload resources. For an Indian SaaS platform, a typical free tier might be 5 GB storage / 100 uploads per month, while an enterprise tier might be 1 TB / unlimited uploads.

What are the cost implications of file upload at scale in Indian cloud regions?

Let us break down the costs for a file upload pipeline deployed in AWS Mumbai (ap-south-1), which is the most common choice for India-based ML platforms: **Storage costs (S3 Standard)**: - $0.025/GB/month (~INR 2.1/GB/month) for the first 50 TB - A team uploading 100 GB/day accumulates ~3 TB/month, costing ~$75/month (~INR 6,300/month) **Transfer costs**: - Data transfer IN to S3 is **free** (uploads cost nothing for transfer) - Data transfer OUT (for processing) within the same region is **free** - Cross-region transfer: $0.086/GB (~INR 7.2/GB) **API request costs**: - PUT requests (uploads): $0.005 per 1,000 requests - Multipart upload with 16 MB chunks for a 100 GB file = ~6,250 PUT requests = ~$0.03 **Post-processing compute**: - A c5.xlarge for virus scanning + schema inference: ~$0.17/hour (~INR 14.3/hour) - Processing 100 GB/day requires roughly 2-3 hours of c5.xlarge time: ~$0.50/day (~INR 42/day) **Total estimated cost for 100 GB/day**: - Storage: INR 6,300/month - Compute: INR 1,260/month - API requests: negligible - **Total: ~INR 7,560/month (~$90/month)** This is remarkably cost-effective. The biggest cost risk is **storage accumulation** -- if you never delete old uploads, costs grow linearly forever. Implement lifecycle policies (move to S3 Glacier after 90 days, delete after 1 year) to keep long-term costs manageable.

Data Ingestion

File Upload in Machine Learning

Every machine learning system starts with data, and the vast majority of that data arrives as files -- CSV exports from business databases, JSON dumps from APIs, Parquet snapshots from data warehouses, and image archives from annotation platforms. The file upload block is the front door of your ML pipeline: it governs how raw data physically enters your system from the outside world.

File upload sounds deceptively simple -- accept a file, write it to storage, done. But at production scale, it becomes a rich engineering problem. A 50 GB training dataset cannot travel over a single HTTP connection without corruption risk. A user uploading 100,000 labeled images needs resumability so that a network blip at image 87,432 doesn't force a restart. A CSV with inconsistent column types needs schema inference before downstream transforms can trust it. And every uploaded file is a potential attack vector that must be scanned before it touches your processing pipeline.

This guide covers the complete lifecycle of file upload in ML systems: from the browser or CLI initiating the transfer, through multipart chunking and cloud storage integration, to format detection, schema inference, virus scanning, and handoff to downstream processing blocks. Whether you are building a data labeling platform in Bengaluru or a training data pipeline at a hyperscaler, the patterns here apply universally.

Concept Snapshot

What It Is: A system component that accepts data files from external sources, transfers them reliably into cloud or local storage, validates their integrity and format, and makes them available for downstream ML pipeline processing.
Category: Data Ingestion
Complexity: Intermediate
Inputs / Outputs: Inputs: raw files (CSV, JSON, Parquet, images, audio, video) from browsers, CLIs, or automated systems. Outputs: validated files in object storage with metadata records (file size, format, schema, checksum, upload status).
System Placement: Sits at the very beginning of the ML pipeline, upstream of data validation, document loaders, and feature engineering. It is the first point of contact between external data and your system.
Also Known As: data upload, file ingestion, bulk data import, dataset upload, data drop zone
Typical Users: ML Engineers, Data Scientists, Data Annotators, Platform Engineers, Backend Engineers
Prerequisites: HTTP protocol basics, Cloud object storage (S3, GCS, Azure Blob), File formats (CSV, JSON, Parquet), Basic security concepts
Key Terms: multipart uploadchunked transferresumable uploadpresigned URLschema inferencecontent-type detectionchecksum verificationvirus scanningSAS token

Why This Concept Exists

The Problem: Data Doesn't Magically Appear in Your Pipeline

ML practitioners spend an enormous amount of time acquiring and preparing data. According to a widely cited survey, data scientists spend 60-80% of their time on data preparation, and a significant chunk of that begins with simply getting data into the system. Without a robust file upload mechanism, teams resort to ad-hoc solutions -- SCP-ing files to servers, emailing CSVs, sharing Google Drive links -- none of which provide the auditability, validation, or reliability that production ML systems demand.

The core problem is deceptively layered. First, there is the transport problem: files range from kilobytes to terabytes, and HTTP was not designed for multi-gigabyte transfers over unreliable networks. A single dropped connection on a 20 GB upload means starting over unless you have chunking and resumability. Second, there is the trust problem: every uploaded file could contain malware, corrupted data, or schema violations that will silently poison your training pipeline. Third, there is the metadata problem: an uploaded file is useless without context -- who uploaded it, when, what format, what schema, which project it belongs to.

Historical Context: From FTP to Cloud-Native Upload

In the early days, data ingestion meant FTP servers and batch file transfers scheduled via cron jobs. Data engineers would drop files into designated directories and downstream scripts would pick them up. This worked for small-scale analytics but broke down as ML systems demanded higher throughput, real-time processing, and multi-tenant isolation.

The shift to cloud object storage (AWS S3 in 2006, Google Cloud Storage in 2010, Azure Blob Storage in 2010) changed the game fundamentally. Instead of managing disk space on physical servers, teams could upload directly to virtually unlimited storage. But this introduced new challenges: managing credentials, handling multipart uploads for large files, and ensuring consistency in distributed storage.

The Modern Pattern: Presigned URLs and Direct Upload

Today's best practice is the presigned URL pattern (or SAS token in Azure). The client requests a time-limited, pre-authorized URL from the backend, then uploads directly to cloud storage without the file ever touching your application server. This eliminates the application server as a bottleneck, reduces bandwidth costs, and scales horizontally by default because the cloud provider handles the heavy lifting.

Key Insight: A well-designed file upload system is not just a file transfer mechanism -- it is the trust boundary between the outside world and your ML pipeline. Everything downstream assumes the upload block has done its job: the file is intact, the format is valid, the content is safe, and the metadata is recorded.

Core Intuition & Mental Model

The Mental Model: A Secure Mailroom

Think of the file upload block as the mailroom of a large organization. Packages (files) arrive from the outside world. The mailroom's job is to: (1) accept the delivery reliably, even if the courier needs multiple trips for a large shipment, (2) scan for hazards before letting anything into the building, (3) log what arrived, from whom, and when, (4) route the package to the correct department. A mailroom that just dumps packages in a pile is technically accepting deliveries but is failing at its actual job.

Similarly, a file upload endpoint that simply writes bytes to disk without validation, scanning, or metadata tracking is technically functional but operationally useless. The real value is in the guarantees it provides to everything downstream.

Why Chunking Changes Everything

Here is the single most important architectural insight: never upload large files as a single HTTP request. Instead, split the file into chunks (typically 5-100 MB each) and upload them independently. This gives you three superpowers:

Resumability -- if chunk 47 of 200 fails, retry only chunk 47, not the entire file
Parallelism -- upload multiple chunks simultaneously, saturating your bandwidth
Progress tracking -- show the user a meaningful progress bar, not a spinning wheel

AWS S3 multipart upload, for example, splits files into parts of 5 MB to 5 GB each, supports up to 10,000 parts, and allows parallel upload of parts. A 50 GB file that takes 45 minutes as a single upload can complete in under 10 minutes with parallel multipart upload -- roughly a 4x speedup.

The Trust Boundary Principle

Every file upload creates a trust boundary crossing. Data produced by external users, third-party systems, or even internal teams in different departments must be treated as untrusted until validated. This means: verify checksums to ensure integrity, detect the actual file format (not just the extension), scan for malware, validate the schema against expectations, and only then mark the file as ready for processing. Skipping any of these steps is how data poisoning attacks and silent pipeline failures happen.

Technical Foundations

Transfer Throughput Model

The effective throughput of a file upload can be modeled as:

$T_{\text{eff}} = \frac{S}{\frac{S}{B \cdot P} + L_{\text{init}} + N_{\text{chunks}} \cdot L_{\text{per-chunk}}}$

where $S$ is the total file size in bytes, $B$ is the available bandwidth per connection in bytes/sec, $P$ is the number of parallel upload connections (capped by the client's upload bandwidth and server-side concurrency limits), $L_{\text{init}}$ is the initial handshake latency (presigned URL generation, multipart upload initiation), $N_{\text{chunks}} = \lceil S / C \rceil$ is the number of chunks for chunk size $C$ , and $L_{\text{per-chunk}}$ is the per-chunk overhead (HTTP headers, checksum verification).

Optimal Chunk Size

The optimal chunk size $C^*$ balances two competing forces:

$C^* = \arg\min_C \left[ \frac{S}{B \cdot \min(P, \lceil S/C \rceil)} + \lceil S/C \rceil \cdot L_{\text{per-chunk}} + p_{\text{fail}} \cdot \frac{C}{B} \right]$

The first term is the parallel transfer time, the second is the total per-chunk overhead, and the third is the expected retry cost where $p_{\text{fail}}$ is the per-chunk failure probability. Smaller chunks reduce retry cost but increase overhead; larger chunks reduce overhead but increase retry cost.

In practice, chunk sizes of 16-64 MB are optimal for most network conditions. AWS recommends a minimum of 5 MB per part, and empirical benchmarks show that 16 MB chunks with 4-8 parallel connections maximize throughput on typical cloud networks.

Checksum Verification

Data integrity is verified using checksums at multiple levels:

Per-chunk checksum: Each chunk is hashed (typically MD5 or SHA-256) before upload. The server verifies the hash upon receipt: $H(\text{chunk}_i) = h_i$ .
Whole-file checksum: After all chunks are assembled, the complete file hash is verified: $H(\text{file}) = h_{\text{expected}}$ .
ETag verification: S3 multipart uploads produce a composite ETag: MD5(MD5(part1) || MD5(part2) || ... || MD5(partN))-N.

Schema Inference Complexity

For tabular files (CSV, JSON, Parquet), schema inference involves sampling $m$ rows and inferring the type of each column $j$ :

$\text{type}(j) = \text{MostSpecificType}\left(\bigcup_{i=1}^{m} \text{InferType}(\text{row}_i[j])\right)$

The type lattice follows: null < bool < int < float < string, where each type subsumes all types above it. If row 1 has an integer in column $j$ and row 500 has a float, the inferred type is float. Parquet files embed their schema in the file footer, so inference is $O(1)$ -- just read the metadata. CSV files require scanning, making inference $O(m \cdot d)$ where $d$ is the number of columns.

Internal Architecture

A production file upload system consists of five layers: a client-side upload manager that handles chunking, retry logic, and progress tracking; an API gateway that authenticates requests and generates presigned URLs; a cloud storage backend that receives and persists file chunks; a post-upload processor that validates, scans, and extracts metadata; and a metadata store that records upload state, schema information, and lineage.

File Upload in ML Systems Architecture — A directed flow starting from 'Client (Browser/CLI)' which requests a presigned URL from 'API Gat...

The key architectural decision is the direct-to-storage upload pattern. The client never sends file bytes through the application server. Instead, the API server acts as an authorization and coordination layer that issues presigned URLs, tracks upload progress, and triggers post-upload processing. This means your API server can be a tiny instance -- it handles only lightweight JSON requests, never gigabytes of file data.

The post-upload processor is triggered by cloud storage events (S3 event notifications, GCS Pub/Sub triggers, or Azure Event Grid). It runs as a serverless function or background worker that performs virus scanning, format validation, schema inference, and metadata extraction. Only after all checks pass is the file marked as "ready" and made available to downstream pipeline stages.

Key Components

Client Upload Manager

Runs in the browser or CLI. Splits files into chunks, manages parallel uploads, tracks progress, handles retries on failure, and computes per-chunk checksums. Libraries like Uppy (browser), tus-js-client, or boto3 (Python CLI) implement this logic.

API Gateway / Upload Coordinator

Authenticates the upload request, validates file metadata (name, expected size, format), generates presigned URLs or SAS tokens for direct cloud storage upload, tracks multipart upload state (initiated, in-progress, completed, failed), and triggers post-upload processing upon completion.

Cloud Object Storage

The actual storage backend (AWS S3, Google Cloud Storage, or Azure Blob Storage). Receives file chunks via presigned URLs, assembles them into complete objects, and emits upload completion events. Provides durability (11 nines for S3), versioning, and lifecycle policies.

Post-Upload Processor

Triggered by storage events. Performs: (1) virus/malware scanning using ClamAV or cloud-native scanners, (2) file format detection using magic bytes, (3) schema inference for tabular files, (4) checksum verification against client-provided hashes, (5) thumbnail generation for images. Quarantines files that fail any check.

Metadata Store

Records upload metadata: file ID, original filename, size, format, schema, uploader identity, upload timestamp, checksum, scan status, and processing status. Typically backed by a relational database (PostgreSQL) or document store. This is the source of truth for what data exists in the system.

Virus Scanner

Scans uploaded files for known malware signatures before they enter the processing pipeline. ClamAV is the standard open-source option. Cloud providers offer managed alternatives: AWS uses Amazon Macie and GuardDuty, Azure provides Microsoft Defender for Storage. Files are quarantined in a separate bucket until cleared.

Schema Registry

Stores and versions the inferred or declared schemas for uploaded datasets. When a new CSV is uploaded, its inferred schema is compared against the expected schema for that dataset. Schema drift (new columns, type changes) triggers alerts or automatic migration. Often implemented using Apache Avro schema registry or a custom metadata table.

Data Flow

Here is the complete data flow for a file upload:

Step 1 - Initiation: The client sends an upload request to the API gateway with file metadata (name, size, content type, target dataset). The gateway authenticates the user, validates quotas, and creates an upload record in the metadata store.

Step 2 - Presigned URL Generation: The gateway generates presigned URLs (one per chunk for multipart upload) with a configurable expiry (typically 1-4 hours). For S3, this uses create_multipart_upload + generate_presigned_url for each part. For Azure, a SAS token with write-only permissions is generated.

Step 3 - Direct Upload: The client uploads chunks directly to cloud storage using the presigned URLs. Chunks can be uploaded in parallel (4-8 concurrent connections is typical). Each chunk includes a Content-MD5 header for integrity verification.

Step 4 - Completion: After all chunks are uploaded, the client notifies the API gateway, which calls complete_multipart_upload on S3 (or commits the block list on Azure). The storage backend assembles the final object.

Step 5 - Post-Processing: The storage completion event triggers the post-upload processor. It downloads the file to a temporary workspace, runs virus scanning, detects the format, infers the schema (for tabular data), and updates the metadata store.

Step 6 - Handoff: If all checks pass, the file status is updated to "ready" and a notification is sent to downstream pipeline stages (data validation, document loader, or feature engineering). If any check fails, the file is quarantined and the uploader is notified.

A directed flow starting from 'Client (Browser/CLI)' which requests a presigned URL from 'API Gateway', receives the URL, then uploads chunks directly to 'Cloud Storage (S3/GCS/Azure Blob)'. Cloud Storage emits an event to 'Post-Upload Processor' which fans out to 'Virus Scanner', 'Schema Registry', and 'Metadata Store'. After processing, a ready notification flows to 'Downstream Pipeline'.

How to Implement

Two Core Implementation Patterns

File upload implementations split into two camps based on where the file bytes flow:

Pattern A: Proxy Upload -- the file passes through your application server, which then writes it to storage. Simple to implement, works for small files (<100 MB), but creates a bottleneck because every byte of every upload consumes your server's CPU, memory, and bandwidth. A single 10 GB upload can saturate a small EC2 instance. This pattern is appropriate only for prototypes or low-volume internal tools.

Pattern B: Direct-to-Storage Upload -- the client uploads directly to cloud storage using presigned URLs or SAS tokens. Your application server only handles lightweight API calls for authorization and coordination. This scales horizontally by default because the cloud provider absorbs the transfer bandwidth. This is the production-grade pattern used by virtually every serious data platform.

For ML-specific workloads, there is an additional consideration: bulk upload. When a data scientist needs to upload 500,000 labeled images or a 200 GB Parquet dataset, neither the browser nor a single HTTP connection is practical. The implementation must support CLI-based bulk upload with parallel multipart transfers, progress checkpointing, and the ability to resume from where it left off after interruption.

Cost Note: Direct-to-S3 upload is essentially free -- you pay only for storage ( $0.023/GB/month, ~INR 1.9/GB/month for S3 Standard). Proxy upload through an EC2 instance adds compute cost: a c5.xlarge instance running as an upload proxy costs ~$ 0.17/hour (~INR 14.3/hour), and handling just 10 concurrent 10 GB uploads would require a much larger instance. At scale, the cost difference is dramatic.

S3 Multipart Upload with Presigned URLs (Python Backend + Client)93 lines

import boto3
import hashlib
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed

# === Backend: Generate presigned URLs for multipart upload ===

def initiate_multipart_upload(bucket: str, key: str) -> dict:
    """Initiate a multipart upload and return upload_id + presigned URLs."""
    s3 = boto3.client('s3')
    
    # Start the multipart upload
    response = s3.create_multipart_upload(
        Bucket=bucket,
        Key=key,
        ContentType='application/octet-stream',
        Metadata={'uploaded-by': 'ml-pipeline', 'status': 'uploading'}
    )
    upload_id = response['UploadId']
    return {'upload_id': upload_id, 'bucket': bucket, 'key': key}


def generate_part_urls(bucket: str, key: str, upload_id: str, num_parts: int,
                       expiry: int = 3600) -> list[dict]:
    """Generate presigned URLs for each part."""
    s3 = boto3.client('s3')
    part_urls = []
    for part_number in range(1, num_parts + 1):
        url = s3.generate_presigned_url(
            'upload_part',
            Params={
                'Bucket': bucket,
                'Key': key,
                'UploadId': upload_id,
                'PartNumber': part_number,
            },
            ExpiresIn=expiry,
        )
        part_urls.append({'part_number': part_number, 'url': url})
    return part_urls


def complete_multipart_upload(bucket: str, key: str, upload_id: str,
                              parts: list[dict]) -> dict:
    """Complete the multipart upload after all parts are uploaded."""
    s3 = boto3.client('s3')
    response = s3.complete_multipart_upload(
        Bucket=bucket,
        Key=key,
        UploadId=upload_id,
        MultipartUpload={'Parts': parts},  # [{'PartNumber': 1, 'ETag': '...'}]
    )
    return {'location': response['Location'], 'etag': response['ETag']}


# === Client: Upload file chunks in parallel ===

def upload_file_multipart(file_path: str, part_urls: list[dict],
                          chunk_size: int = 16 * 1024 * 1024,
                          max_workers: int = 4) -> list[dict]:
    """Upload a file in parallel chunks using presigned URLs."""
    parts = []
    
    def upload_part(part_info: dict, data: bytes) -> dict:
        md5 = hashlib.md5(data).digest()
        response = requests.put(
            part_info['url'],
            data=data,
            headers={'Content-MD5': __import__('base64').b64encode(md5).decode()},
        )
        response.raise_for_status()
        return {
            'PartNumber': part_info['part_number'],
            'ETag': response.headers['ETag'],
        }
    
    with open(file_path, 'rb') as f:
        chunks = []
        for part_info in part_urls:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            chunks.append((part_info, chunk))
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(upload_part, part_info, data): part_info
            for part_info, data in chunks
        }
        for future in as_completed(futures):
            parts.append(future.result())
    
    return sorted(parts, key=lambda p: p['PartNumber'])

This example shows the complete server-client interaction for S3 multipart upload. The backend initiates the upload and generates presigned URLs -- one per chunk. The client reads the file in chunks (16 MB default), computes an MD5 checksum for each chunk, and uploads them in parallel using a thread pool. Each chunk is independently retryable. After all parts are uploaded, the backend calls complete_multipart_upload to assemble the final object. This pattern keeps file bytes off your application server entirely.

Resumable Upload with tus Protocol (Python Client)49 lines

import os
import hashlib
from tusclient import client as tus_client

def resumable_upload(file_path: str, upload_url: str,
                     chunk_size: int = 32 * 1024 * 1024) -> str:
    """Upload a file using the tus resumable upload protocol.
    
    If interrupted, calling this again with the same file will
    resume from where it left off.
    """
    file_size = os.path.getsize(file_path)
    file_hash = hashlib.sha256(open(file_path, 'rb').read(8192)).hexdigest()
    
    # Create tus client
    my_client = tus_client.TusClient(upload_url)
    
    # Create uploader with fingerprint for resume capability
    uploader = my_client.uploader(
        file_path,
        chunk_size=chunk_size,
        metadata={
            'filename': os.path.basename(file_path),
            'filetype': _detect_mime_type(file_path),
            'sha256_prefix': file_hash,
        },
        # Store upload URL locally for resumption
        store_url=True,
        url_storage=tus_client.fingerprint.Fingerprint(),
    )
    
    # Upload with automatic resume
    uploader.upload()
    return uploader.url


def _detect_mime_type(file_path: str) -> str:
    """Detect MIME type from file magic bytes."""
    import magic
    mime = magic.Magic(mime=True)
    return mime.from_file(file_path)


# Usage
upload_location = resumable_upload(
    file_path='/data/training_images.tar.gz',
    upload_url='https://upload.example.com/files/',
    chunk_size=64 * 1024 * 1024,  # 64 MB chunks
)

The tus protocol is an open standard for resumable file uploads over HTTP. Unlike S3 multipart upload which is cloud-specific, tus is cloud-agnostic and works with any backend that implements the protocol. The key feature is automatic resume: if the upload is interrupted (network drop, process crash, laptop closing), calling the upload function again with the same file automatically detects the previously uploaded offset and continues from there. The tusd server can be configured to store files on S3, GCS, or Azure Blob behind the scenes.

Post-Upload Schema Inference for CSV and Parquet Files117 lines

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv
import json
from pathlib import Path


def infer_schema(file_path: str, sample_rows: int = 10000) -> dict:
    """Infer schema from uploaded file (CSV, Parquet, or JSON).
    
    Returns a schema dict with column names, types, nullability,
    and basic statistics.
    """
    path = Path(file_path)
    ext = path.suffix.lower()
    
    if ext == '.parquet':
        return _infer_parquet_schema(file_path)
    elif ext == '.csv':
        return _infer_csv_schema(file_path, sample_rows)
    elif ext in ('.json', '.jsonl'):
        return _infer_json_schema(file_path, sample_rows)
    else:
        return {'format': ext, 'schema': None, 'error': 'Unsupported format'}


def _infer_parquet_schema(file_path: str) -> dict:
    """Parquet stores schema in footer -- O(1) inference."""
    pf = pq.ParquetFile(file_path)
    schema = pf.schema_arrow
    metadata = pf.metadata
    
    columns = []
    for i in range(len(schema)):
        field = schema.field(i)
        columns.append({
            'name': field.name,
            'type': str(field.type),
            'nullable': field.nullable,
        })
    
    return {
        'format': 'parquet',
        'num_rows': metadata.num_rows,
        'num_columns': len(schema),
        'num_row_groups': metadata.num_row_groups,
        'serialized_size_bytes': metadata.serialized_size,
        'columns': columns,
        'created_by': metadata.created_by,
    }


def _infer_csv_schema(file_path: str, sample_rows: int) -> dict:
    """CSV requires sampling rows for type inference."""
    read_options = csv.ReadOptions(block_size=1024 * 1024)  # 1 MB blocks
    convert_options = csv.ConvertOptions(
        auto_dict_encode=True,
        auto_dict_max_cardinality=50,
    )
    
    # Read a sample for inference
    table = csv.read_csv(
        file_path,
        read_options=read_options,
        convert_options=convert_options,
    )
    
    # Sample if the file is large
    if table.num_rows > sample_rows:
        table = table.slice(0, sample_rows)
    
    columns = []
    for i, name in enumerate(table.column_names):
        col = table.column(i)
        columns.append({
            'name': name,
            'type': str(col.type),
            'nullable': col.null_count > 0,
            'null_count': col.null_count,
            'null_fraction': round(col.null_count / len(col), 4),
        })
    
    return {
        'format': 'csv',
        'num_rows_sampled': table.num_rows,
        'num_columns': len(table.column_names),
        'columns': columns,
    }


def _infer_json_schema(file_path: str, sample_rows: int) -> dict:
    """Infer schema from JSON/JSONL by sampling records."""
    table = pa.json_.read_json(file_path)
    
    if table.num_rows > sample_rows:
        table = table.slice(0, sample_rows)
    
    columns = []
    for i, name in enumerate(table.column_names):
        col = table.column(i)
        columns.append({
            'name': name,
            'type': str(col.type),
            'nullable': col.null_count > 0,
        })
    
    return {
        'format': 'json',
        'num_rows_sampled': table.num_rows,
        'num_columns': len(table.column_names),
        'columns': columns,
    }


# Usage
schema_info = infer_schema('/uploads/training_data.parquet')
print(json.dumps(schema_info, indent=2))

This example demonstrates format-aware schema inference. Parquet files embed their schema in the file footer, so inference is instant -- just read the metadata without scanning any data rows. CSV files require sampling rows and inferring types using PyArrow's built-in type inference (which follows the type lattice: null < bool < int < float < string). JSON/JSONL files are parsed and the schema is inferred from the structure of sampled records. The output includes column names, data types, nullability, and basic statistics -- exactly what downstream data validation and feature engineering blocks need.

Virus Scanning with ClamAV for Uploaded Files97 lines

import clamd
import logging
from enum import Enum
from dataclasses import dataclass

logger = logging.getLogger(__name__)


class ScanResult(Enum):
    CLEAN = 'clean'
    INFECTED = 'infected'
    ERROR = 'error'


@dataclass
class FileScanReport:
    file_path: str
    result: ScanResult
    threat_name: str | None = None
    error_message: str | None = None


class FileScanner:
    """Scan uploaded files for malware using ClamAV daemon."""
    
    def __init__(self, host: str = 'localhost', port: int = 3310):
        self.scanner = clamd.ClamdNetworkSocket(host=host, port=port)
        # Verify ClamAV daemon is running
        try:
            self.scanner.ping()
        except clamd.ConnectionError:
            raise RuntimeError(
                'ClamAV daemon is not reachable. '
                'Ensure clamd is running on {host}:{port}'
            )
    
    def scan_file(self, file_path: str) -> FileScanReport:
        """Scan a single file and return the result."""
        try:
            result = self.scanner.scan(file_path)
            if result is None:
                return FileScanReport(
                    file_path=file_path,
                    result=ScanResult.CLEAN
                )
            
            # ClamAV returns: {'/path/to/file': ('FOUND', 'Malware.Name')}
            status, threat = result[file_path]
            if status == 'FOUND':
                logger.warning(f'Malware detected in {file_path}: {threat}')
                return FileScanReport(
                    file_path=file_path,
                    result=ScanResult.INFECTED,
                    threat_name=threat,
                )
            return FileScanReport(
                file_path=file_path,
                result=ScanResult.CLEAN
            )
        except Exception as e:
            logger.error(f'Scan error for {file_path}: {e}')
            return FileScanReport(
                file_path=file_path,
                result=ScanResult.ERROR,
                error_message=str(e),
            )
    
    def scan_stream(self, file_stream) -> FileScanReport:
        """Scan a file stream without writing to disk."""
        try:
            result = self.scanner.instream(file_stream)
            status, threat = result['stream']
            if status == 'FOUND':
                return FileScanReport(
                    file_path='<stream>',
                    result=ScanResult.INFECTED,
                    threat_name=threat,
                )
            return FileScanReport(
                file_path='<stream>',
                result=ScanResult.CLEAN
            )
        except Exception as e:
            return FileScanReport(
                file_path='<stream>',
                result=ScanResult.ERROR,
                error_message=str(e),
            )


# Usage
scanner = FileScanner(host='clamav-service', port=3310)
report = scanner.scan_file('/tmp/uploads/user_dataset.csv')
if report.result == ScanResult.INFECTED:
    # Quarantine the file
    move_to_quarantine(report.file_path)
    notify_security_team(report)

This example wraps the ClamAV daemon client (pyclamd) in a clean interface for scanning uploaded files. ClamAV runs as a separate daemon (clamd) -- typically in a Docker container alongside your post-upload processor. Files can be scanned either from disk or as a stream (useful when you want to scan before writing to permanent storage). The scan result is one of CLEAN, INFECTED, or ERROR. Infected files must be quarantined immediately -- moved to a separate storage bucket that is not accessible to the processing pipeline. The ClamAV virus database should be updated daily via freshclam.

Configuration Example49 lines

# File Upload Service Configuration (YAML)
upload:
  max_file_size_gb: 100
  max_files_per_upload: 1000
  allowed_formats:
    - csv
    - json
    - jsonl
    - parquet
    - avro
    - png
    - jpg
    - jpeg
    - tiff
    - wav
    - mp3
  chunk_size_mb: 16
  max_parallel_chunks: 8
  presigned_url_expiry_seconds: 3600
  checksum_algorithm: sha256

storage:
  provider: s3  # s3 | gcs | azure_blob
  bucket: ml-data-uploads
  region: ap-south-1  # Mumbai for India-based deployments
  quarantine_bucket: ml-data-quarantine
  lifecycle:
    incomplete_upload_cleanup_days: 7
    quarantine_retention_days: 30

scanning:
  enabled: true
  engine: clamav
  clamav_host: clamav-service
  clamav_port: 3310
  max_scan_size_gb: 10
  timeout_seconds: 300

schema_inference:
  enabled: true
  sample_rows: 10000
  max_columns: 5000
  type_coercion: true
  register_schema: true

quotas:
  per_user_daily_gb: 500
  per_upload_max_gb: 100
  concurrent_uploads_per_user: 5

Common Implementation Mistakes

●
Proxy upload at scale: Routing file bytes through the application server instead of using presigned URLs. This creates a bandwidth bottleneck, increases server costs, and adds a single point of failure. A single m5.xlarge handling proxy uploads can only sustain ~125 MB/s -- that's 8 concurrent 1 Gbps uploads. Use direct-to-storage upload with presigned URLs.
●
No checksum verification: Accepting uploaded files without verifying checksums. Network corruption, truncated transfers, and even man-in-the-middle attacks can silently alter file content. Always compute and verify Content-MD5 or SHA-256 at both the chunk and whole-file level.
●
Trusting file extensions: Using the file extension (.csv, .json) to determine format instead of inspecting magic bytes. A file named training_data.csv could actually be a ZIP archive, an executable, or a malformed binary. Always use content-type detection libraries like python-magic or Apache Tika.
●
Single-connection upload for large files: Uploading a 50 GB dataset as a single HTTP request. Any network interruption forces a complete restart. Use multipart/chunked upload with a minimum chunk size of 5 MB and parallel connections. The retry cost of restarting a 50 GB upload is enormous -- potentially hours of wasted time and bandwidth.
●
Skipping virus scanning: Assuming uploaded files from trusted users are safe. Even files from internal teams can be unintentionally infected. A malicious CSV with embedded macros or a poisoned pickle file can compromise your entire processing pipeline. Always scan before processing.
●
No upload size limits: Allowing unlimited file sizes without quotas. A single user uploading a 5 TB file can exhaust storage budgets or overwhelm post-processing workers. Implement per-user and per-upload size limits, and enforce them at the presigned URL level (S3 supports upload conditions).
●
Ignoring schema drift: Not comparing the inferred schema of a new upload against the expected schema for that dataset. If a training CSV suddenly has a new column or a changed data type, downstream feature engineering will break silently. Always validate schemas against a registry.

When Should You Use This?

Use When

Your ML pipeline accepts data from external users, data annotators, or partner organizations who provide files in standard formats (CSV, JSON, Parquet, images)
Training datasets are prepared offline and need to be uploaded in bulk -- for example, a data science team exporting 100 GB of labeled data from a labeling platform
You are building a data labeling or annotation platform where annotators upload images, audio, or text files for ML training
The data source is a one-time or periodic batch export from a business system (ERP, CRM, analytics platform) rather than a continuous stream
You need a human-in-the-loop step where data scientists review, upload, and approve training data before it enters the pipeline
Your system must support multiple file formats with automatic detection and schema inference, rather than requiring a single fixed format
Regulatory requirements mandate that data ingestion has an audit trail showing who uploaded what, when, with full provenance tracking

Avoid When

Data arrives as a continuous stream (event logs, sensor data, clickstream) -- use a streaming ingestion block (Kafka, Kinesis) instead
The data source is a database or API that can be queried directly -- use a batch data source connector or API ingestion block instead of exporting and re-uploading files
Files are already in cloud storage and don't need to cross a trust boundary -- use a direct cloud-to-cloud copy (S3 cross-region replication, gsutil cp) instead of re-uploading
The upload volume is extremely low (a few small files per week) and doesn't justify the infrastructure complexity -- a simple form upload to your application server may suffice
You need sub-second data freshness -- file upload is inherently batch-oriented and adds latency from post-processing (scanning, validation, schema inference)
The data is already in a structured, queryable format in a data warehouse -- extracting it as files and re-uploading is wasteful

Key Tradeoffs

Direct Upload vs. Proxy Upload

Factor	Direct-to-Storage (Presigned URL)	Proxy Upload (Through App Server)
Scalability	Virtually unlimited (cloud provider handles bandwidth)	Limited by app server capacity
Cost	Storage cost only (~$0.023/GB/month)	Storage + compute + bandwidth
Latency	Lower (one fewer hop)	Higher (extra hop through app server)
Complexity	Higher (presigned URL management, CORS, event-driven post-processing)	Lower (simple file handling in app code)
Observability	Harder (upload happens outside your stack)	Easier (you see every byte)
Virus Scanning	Post-upload (asynchronous)	Can scan inline (synchronous)

For any deployment handling more than a few uploads per day or files larger than 100 MB, direct-to-storage is the clear winner. The added architectural complexity pays for itself immediately in reduced server costs and improved reliability.

Resumable vs. Non-Resumable Upload

Resumable upload (via tus protocol or S3 multipart) adds approximately 15-20% implementation complexity but is essential for files larger than 100 MB or for users on unreliable networks. In India, where mobile network connectivity can be inconsistent, resumability is not optional -- it is a hard requirement for any user-facing upload feature. A data annotator in a tier-2 city uploading labeled images over a 4G connection will abandon your platform if their upload fails at 90% and they have to restart.

Schema Inference: Eager vs. Lazy

Eager inference (at upload time) adds latency to the upload flow but catches schema issues immediately. Lazy inference (deferred to processing time) is faster for the uploader but pushes errors downstream where they are more expensive to debug. The right choice depends on your pipeline's tolerance for late failure. For ML training pipelines, eager inference is almost always better -- catching a malformed CSV at upload time saves hours of wasted training compute.

Alternatives & Comparisons

Batch Data Source Connector

A batch data source connector pulls data directly from databases, APIs, or data warehouses on a schedule. Use it when the data already lives in a queryable system and exporting to files would be wasteful. File upload is better when data originates outside your infrastructure -- from annotators, partners, or manual exports -- and needs to cross a trust boundary.

Document Loader

A document loader parses and extracts content from files that are already in storage (PDFs, Word docs, HTML). It operates downstream of file upload -- first the file is uploaded and validated, then the document loader extracts its content. They are complementary, not alternatives. However, for unstructured documents, you might combine them into a single 'upload-and-parse' pipeline step.

Streaming Ingestion (Kafka/Kinesis)

Streaming ingestion handles continuous, real-time data flows -- event logs, sensor data, clickstream. File upload handles discrete, batch-oriented data drops. If your data arrives as a continuous stream, file upload is the wrong abstraction. But many production systems use both: streaming for live data and file upload for historical backfills and one-time dataset imports.

API Ingestion

API ingestion pulls data from REST or GraphQL APIs, typically record-by-record or in paginated batches. File upload is better for bulk data (thousands to millions of records at once) and for data that doesn't have a queryable API. API ingestion is better for real-time or near-real-time data from third-party services.

Pros, Cons & Tradeoffs

Advantages

Universal compatibility: Every data source can export files. CSV, JSON, Parquet, and images are the lingua franca of ML data -- no special connectors or API integrations needed. A data scientist in any organization can produce a file.
Human-in-the-loop friendly: File upload naturally supports manual review workflows. Data scientists can inspect, clean, and approve datasets before uploading, providing a critical quality gate that automated ingestion methods lack.
Bulk efficiency: Multipart upload with parallelism can transfer terabytes of data at near-wire-speed. A 100 GB dataset uploads to S3 in under 15 minutes on a 1 Gbps connection with 8 parallel streams.
Audit trail: Every upload creates a discrete, traceable event with metadata (who, when, what, size, checksum). This makes regulatory compliance (GDPR, India's DPDP Act) straightforward because you have a clear record of every data ingestion event.
Format flexibility: A single upload endpoint can accept CSV, JSON, Parquet, images, audio, and video. Downstream format-specific processing is triggered based on the detected file type, not hardcoded in the upload path.
Resumability: With multipart or tus-based upload, large transfers survive network interruptions. This is critical for ML workloads where training datasets can be hundreds of gigabytes.
Cost-effective: Direct-to-storage upload via presigned URLs has essentially zero compute cost. Storage in S3 Standard in Mumbai (ap-south-1) costs just INR 1.9/GB/month ($0.023/GB/month).

Disadvantages

Inherently batch-oriented: File upload introduces latency -- the entire file must be uploaded, scanned, validated, and processed before data is available. For use cases requiring sub-minute data freshness, this is too slow.
Schema rigidity risk: Without careful schema inference and validation, subtle format changes (a new column, a type change, a different date format) can silently break downstream processing. The file was uploaded successfully, but the data is wrong.
Storage cost at scale: Raw file storage accumulates quickly. A team uploading 10 GB of training data daily accumulates 3.6 TB per year -- roughly INR 6,600/year ($80/year) in S3 Standard. With multiple teams and no lifecycle policies, costs grow linearly and indefinitely.
Security surface area: Every file upload is a potential attack vector. Malware in uploaded files, path traversal attacks in filenames, zip bombs, and data exfiltration via upload endpoints all require active mitigation.
Duplication risk: Without deduplication, the same dataset can be uploaded multiple times by different users, wasting storage and creating confusion about which version is canonical. Content-addressed storage (hashing file content) helps but adds complexity.
Post-processing bottleneck: Virus scanning, schema inference, and validation happen after upload, creating a queue. During bulk upload periods (e.g., end-of-sprint data drops), the post-processing queue can back up, delaying data availability.

Set decompression limits: maximum decompressed size (e.g., 10x the compressed size), maximum nesting depth (e.g., 2 levels), and maximum number of files in an archive. ClamAV's MaxFileSize and MaxScanSize options provide protection. Monitor decompression ratios and reject files that exceed safe thresholds.

Placement in an ML System

The Gateway to Your ML Pipeline

File upload sits at the very beginning of the ML data pipeline -- it is literally the first block that data touches when entering your system from the outside world. Every downstream block depends on file upload having done its job correctly: the file is intact, the format is valid, the content is safe, and the metadata is recorded.

In a typical ML system architecture, the flow is: File Upload -> Data Validation -> Feature Engineering -> Model Training. For unstructured data (documents, images), the flow includes a document loader: File Upload -> Document Loader -> Chunking -> Embedding. For data labeling platforms, the flow loops: File Upload -> Annotation UI -> Export -> File Upload (labeled data).

The file upload block interacts with object storage as its primary persistence layer. Raw uploaded files are stored in object storage (S3, GCS, Azure Blob) and referenced by downstream blocks via storage paths. The upload block also writes metadata to a metadata store (typically PostgreSQL or a document database) that tracks upload lineage, schemas, and processing status.

Design Principle: The file upload block should be the only entry point for external file data into your ML system. If data can enter through other paths (direct S3 writes, shared drives, email attachments), you lose the validation, scanning, and audit guarantees that the upload block provides. Enforce this architecturally, not just by convention.

Pipeline Stage

Data Ingestion

Upstream

External data sources (human annotators, partner systems, data exports)
Data labeling platforms

Downstream

data-validation
document-loader
object-storage
feature-engineering

Scaling Bottlenecks

Where File Upload Gets Tight

The primary bottleneck is post-upload processing throughput. Virus scanning with ClamAV processes files at roughly 25-50 MB/s on a single core. Schema inference for large CSV files (100M+ rows) can take minutes. When multiple large uploads arrive simultaneously, the post-processing queue becomes the chokepoint.

Concrete numbers: a single ClamAV worker can scan approximately 2 TB/day. If your platform ingests 10 TB/day, you need at least 5 scanner workers running in parallel. Schema inference for a 10 GB CSV with 500 columns takes approximately 30-60 seconds with PyArrow.

The second bottleneck is cloud storage API rate limits. S3 supports 3,500 PUT requests per second per prefix. If many users upload small files (images) concurrently to the same prefix, you will hit throttling. The solution is to distribute uploads across multiple prefixes using hash-based or date-based partitioning.

Network bandwidth is rarely the bottleneck in cloud environments (inter-region transfers within AWS run at 5-25 Gbps) but can be significant for client-side uploads, especially in India where average upload speeds range from 10-50 Mbps on broadband and 5-20 Mbps on mobile networks.

Production Case Studies

DatabricksData & AI Platform

Databricks Auto Loader automatically ingests files as they arrive in cloud storage, with built-in schema inference and evolution. When new CSV, JSON, or Parquet files are uploaded to a cloud storage path, Auto Loader detects them, infers the schema from the first batch, and evolves it as new columns appear in subsequent uploads. This eliminates the manual schema management that traditionally accompanies file-based data ingestion.

Outcome:

Auto Loader processes millions of files per hour across customer deployments, with schema inference adding less than 5% overhead compared to fixed-schema ingestion. Schema evolution handles additive changes (new columns) automatically, reducing pipeline maintenance by an estimated 40% for teams with frequently changing data sources.

UberTransportation / ML Platform

Uber's Michelangelo ML platform includes a data management layer that ingests training data from multiple sources, including file uploads from data science teams. The platform's Palette feature store accepts bulk data imports from files (CSV, Parquet) and registers their schemas for reuse across ML projects. The ingestion pipeline validates uploaded datasets against expected schemas before making them available for feature engineering and model training.

Outcome:

Michelangelo's data ingestion layer supports hundreds of ML models in production. The schema validation at upload time catches approximately 15% of data quality issues before they reach the training pipeline, saving significant compute costs that would otherwise be wasted on training with malformed data.

RazorpayFintech (India)

Razorpay's internal data pipeline, codenamed Lumberjack, ingests data from multiple producers into their data lake. While primarily event-driven, the system also handles bulk file uploads for merchant onboarding data, KYC documents, and transaction reconciliation files. Files uploaded by merchants are scanned, validated against expected schemas, and routed to appropriate processing pipelines. The system handles sensitive financial data under RBI regulatory requirements, making audit trails and data integrity non-negotiable.

Outcome:

The data pipeline processes over 500 million events daily with zero data loss guarantees. File-based bulk imports for merchant reconciliation reduced manual processing time by 70% compared to the previous email-based submission workflow.

Google CloudCloud Infrastructure

Google Cloud Architecture Center's official guide for deploying automated malware scanning on file uploads using ClamAV engine in Cloud Run, with Eventarc triggers and quarantine workflows for infected files in Cloud Storage.

Outcome:

Production-ready architecture that scans files on upload, moves clean files to clean bucket and infected files to quarantine bucket; includes automated ClamAV database mirroring via Cloud Scheduler with comprehensive logging and monitoring through Cloud Operations.

Tooling & Ecosystem

Uppy

JavaScriptOpen Source

Open-source modular file uploader for the browser. Supports drag-and-drop, webcam, screen capture, and remote sources (Google Drive, Dropbox, Instagram). Integrates with tus protocol for resumable uploads and can upload directly to S3, GCS, or any tus-compatible server. The golden standard for building file upload UIs.

tus (tusd server)

GoOpen Source

Open protocol and reference server implementation for resumable file uploads. The tusd server (written in Go) can store files on local disk, S3, GCS, or Azure Blob. Clients include JavaScript (tus-js-client), Python (tusclient), Java, and more. Used by Cloudflare, Supabase, and Vimeo in production.

AWS S3 Multipart Upload

Multi-language SDKsCommercial

Native S3 API for uploading large objects in parts. Supports files up to 5 TB, parts from 5 MB to 5 GB, and up to 10,000 parts per upload. Available through all AWS SDKs (boto3, @aws-sdk/client-s3, etc.). The TransferManager utility in boto3 automatically handles multipart upload with configurable concurrency.

ClamAV

COpen Source

Open-source antivirus engine for detecting malware in uploaded files. Runs as a daemon (clamd) with client libraries in Python (pyclamd), Node.js, and more. Signature database updated daily via freshclam. The standard choice for file upload security in open-source and self-hosted environments.

Apache Arrow / PyArrow

C++ / PythonOpen Source

In-memory columnar data format with Python bindings. PyArrow provides fast schema inference for CSV, JSON, and Parquet files, cross-format conversion, and efficient reading of large files. The pyarrow.csv.read_csv() and pyarrow.parquet.read_schema() functions are the foundation of most schema inference implementations.

Great Expectations

PythonOpen Source

Data validation framework that lets you define expectations (tests) for your data and validate uploaded files against them. Supports CSV, Parquet, and database sources. Integrates with Airflow, Prefect, and Dagster for pipeline-level validation. Use it to enforce data contracts on every upload.

python-magic

PythonOpen Source

Python wrapper around libmagic for detecting file types from magic bytes (content-based detection, not extension-based). Essential for the post-upload processor to verify that uploaded files actually match their declared format. Prevents format spoofing attacks.

Filepond

JavaScriptOpen Source

JavaScript file upload library with a smooth, accessible UI. Supports chunked uploads, image preview, file validation, and server-side processing. Lighter-weight than Uppy for simpler upload scenarios. Available as React, Vue, Angular, and Svelte components.

Research & References

Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training

Zhao, Xie, Pan, Wu, et al. (2022)ISCA 2022

Meta's analysis of their end-to-end Data Storage and Ingestion (DSI) pipeline for deep learning recommendation models. Shows that data loading and preprocessing can account for up to 32% of total training time when not optimized, and presents the architecture for their warehouse-scale data ingestion system.

A Survey of Pipeline Tools for Data Engineering

Pereira, Araujo, Braga (2024)arXiv preprint

Comprehensive survey of data pipeline tools covering ETL/ELT, data integration, ingestion, transformation, orchestration, and ML pipelines. Compares file-based ingestion patterns across major frameworks including Apache Airflow, Prefect, and Dagster.

Data Pipeline Quality: Influencing Factors, Root Causes of Data-Related Issues, and Processing Antipatterns

Quaranta, Calefato, Lanubile (2023)arXiv preprint

Identifies root causes of data quality issues in ML pipelines, with file ingestion being a primary source of schema drift, format inconsistency, and data corruption. Proposes antipatterns to avoid during data ingestion design.

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Whang, Roh, Song, Lee (2023)The VLDB Journal

Surveys data collection and quality challenges from a data-centric AI perspective. Covers data acquisition (including file upload and crowdsourcing), labeling, integration, cleaning, and their impact on model performance. Argues that data quality at ingestion time is more impactful than model architecture improvements.

Data Management Challenges in Production Machine Learning

Polyzotis, Roy, Whang, Zinkevich (2018)ACM SIGMOD 2018

Google's seminal paper on data management in production ML. Identifies data ingestion validation as a critical challenge and introduces patterns for schema validation, data slicing analysis, and anomaly detection at the ingestion boundary.

cedar: Composable and Optimized Machine Learning Input Data Pipelines

Zhao, Harik, et al. (2024)arXiv preprint

Presents cedar, a composable framework for ML input data pipelines that unifies offline feature engineering with online data loading. Shows that data ingestion pipeline design significantly affects training throughput, with optimized pipelines achieving up to 2.3x improvement.

Interview & Evaluation Perspective

Common Interview Questions

●
How would you design a file upload system for an ML platform that handles datasets up to 100 GB?
●
What are the security considerations when accepting file uploads from external users for ML training data?
●
How do you handle schema inference and validation for uploaded CSV files in a production pipeline?
●
Explain the tradeoffs between proxy upload and direct-to-storage upload with presigned URLs.
●
How would you design resumable uploads for a data annotation platform used by annotators across India?
●
What happens if a user uploads a large file and the connection drops at 80%? Walk me through the recovery flow.
●
How do you prevent data poisoning through malicious file uploads in an ML pipeline?

Key Points to Mention

●
Always use direct-to-storage upload with presigned URLs for production systems. Your application server should never handle file bytes -- only authorization and metadata. This is the single most important architectural decision.
●
Multipart upload with parallel chunks is essential for files over 100 MB. Optimal chunk size is 16-64 MB with 4-8 parallel connections. Always compute and verify checksums at the chunk and whole-file level.
●
Virus scanning is not optional, even for internal uploads. ClamAV is the standard open-source choice. Files must be quarantined until cleared -- never allow unscanned files into the processing pipeline.
●
Schema inference at upload time catches data quality issues early. Parquet gives you schema for free (it is embedded in the file footer). CSV requires sampling. Always compare inferred schemas against expected schemas from a registry.
●
Resumable uploads (via tus protocol or S3 multipart) are essential for user-facing platforms, especially in markets like India where network reliability varies significantly across regions.
●
Consider the trust boundary: every uploaded file is untrusted until validated. Check magic bytes (not just extensions), scan for malware, validate schemas, and enforce size limits.

Pitfalls to Avoid

●
Saying you would proxy file uploads through the application server for a high-scale system. This immediately signals that you haven't built upload systems at scale.
●
Forgetting virus scanning and treating it as a 'nice to have'. In any system that accepts external files, malware scanning is a hard security requirement.
●
Not mentioning checksums or data integrity verification. A file upload system that can't detect corruption is fundamentally broken.
●
Ignoring the operational aspects: incomplete upload cleanup, storage lifecycle policies, upload quotas, and monitoring. These are what separate a prototype from a production system.
●
Assuming all files are small. ML datasets range from megabytes to terabytes. Your design must handle the full range without requiring different code paths for different sizes.

Senior-Level Expectation

A senior or staff-level candidate should discuss the complete upload lifecycle: client-side chunking and progress tracking, presigned URL generation with appropriate conditions (expiry, size limits, content type), direct-to-storage upload with parallel multipart transfers, event-driven post-processing (virus scanning, format detection, schema inference, metadata recording), and integration with downstream pipeline stages. They should address security in depth: magic-byte format detection, ClamAV integration, zip bomb protection, presigned URL abuse prevention, and the principle of quarantine-before-process. Cost analysis is expected: compare proxy vs. direct upload costs at scale, storage pricing across regions (especially ap-south-1 for India deployments), and the cost of post-processing compute. For Indian market context, discuss how network variability affects upload design -- resumability is not optional when your annotators are on 4G connections in tier-2 cities. The ability to reason about schema inference strategies (eager vs. lazy, type lattice, handling missing values) and their downstream impact on training data quality separates senior engineers from mid-level ones.

Summary

File upload is the front door of every ML data pipeline -- the mechanism by which raw data crosses the trust boundary from the outside world into your processing infrastructure. While it sounds simple, production-grade file upload involves a stack of engineering challenges: multipart chunking for reliability and parallelism, presigned URLs for scalable direct-to-storage transfer, virus scanning for security, magic-byte format detection for safety, schema inference for data quality, and metadata tracking for audit and lineage.

The key architectural pattern is direct-to-storage upload with presigned URLs (or SAS tokens on Azure). Your application server should never handle file bytes -- it acts solely as an authorization and coordination layer. File data flows directly from the client to cloud storage (S3, GCS, Azure Blob), and an event-driven post-upload processor handles validation, scanning, and schema inference asynchronously. This pattern scales horizontally by default, costs almost nothing in compute, and keeps your API server lightweight.

For ML-specific workloads, the most critical downstream consideration is schema management. A training CSV that silently changes its column types will produce a model trained on corrupted features -- and you won't know until the model underperforms in production. Eager schema inference at upload time, combined with schema registry validation and drift detection, is the strongest defense against this class of failure. Combined with mandatory virus scanning (ClamAV for self-hosted, cloud-native scanners for managed deployments), size quotas, and content-type verification, the file upload block provides the trust guarantees that every downstream pipeline stage depends on.

Concept Snapshot

Why This Concept Exists

The Problem: Data Doesn't Magically Appear in Your Pipeline

Historical Context: From FTP to Cloud-Native Upload

The Modern Pattern: Presigned URLs and Direct Upload

Core Intuition & Mental Model

The Mental Model: A Secure Mailroom

Why Chunking Changes Everything

The Trust Boundary Principle

Technical Foundations

Transfer Throughput Model

Optimal Chunk Size

Checksum Verification

Schema Inference Complexity

Internal Architecture

Key Components

Data Flow

How to Implement

Two Core Implementation Patterns

Common Implementation Mistakes

When Should You Use This?

Use When

Avoid When

Key Tradeoffs

Direct Upload vs. Proxy Upload

Resumable vs. Non-Resumable Upload

Schema Inference: Eager vs. Lazy

Alternatives & Comparisons

Pros, Cons & Tradeoffs

Advantages

Disadvantages

Failure Modes & Debugging

Incomplete multipart upload accumulation

Silent data corruption during transfer

File format spoofing / malicious uploads

Presigned URL abuse

Schema drift causing silent pipeline failure

Zip bomb / decompression bomb

Placement in an ML System

The Gateway to Your ML Pipeline

Pipeline Stage

Upstream

Downstream

Scaling Bottlenecks

Production Case Studies

Tooling & Ecosystem

Research & References

Interview & Evaluation Perspective

Common Interview Questions

Key Points to Mention

Pitfalls to Avoid

Senior-Level Expectation

Summary

Related Blocks & Further Reading

Related ML Blocks

Further Reading