Services — CurateLM

01 — JSONL Dataset Creation

Training data your model can actually learn from

We design, build and validate JSONL datasets for fine-tuning. Every dataset is schema-matched to your model architecture, with format, quality filters and annotation tailored to your use case.

instruction-tuning DPO pairs RLHF chat format completion

Request this service

1

Schema design

We define the exact JSON structure needed for your training framework — OpenAI, HuggingFace, Axolotl, etc.

2

Data transformation

Raw inputs are converted, cleaned and formatted into the agreed schema with deduplication and normalisation.

3

Quality validation

Every row is validated against the schema. A quality score is assigned and flagged rows are reviewed manually.

4

Delivery with data card

Final JSONL file + data card + stats report + sample validation set.

02 — Parquet Conversion

Large-scale datasets, columnar and fast

We convert large raw datasets into schema-optimised Parquet files ready for ingestion via Hugging Face Datasets, Apache Spark, S3, BigQuery or any other columnar data store.

pre-training columnar HuggingFace Spark

Request this service

Output formats we deliver

.parquetcolumnar, compressed

.jsonlnewline-delimited JSON

.arrowApache Arrow IPC

.csvon request

HF Datasetpush_to_hub ready

Customyour schema

03 — Data Cleaning and QA

Clean data is the foundation of every good model

We run your raw corpus through a full cleaning pipeline: deduplication, language detection, normalisation, PII scrubbing, toxicity filtering, and quality scoring. You get a clean, audited dataset with a full report.

deduplication PII removal language filter quality score toxicity filter

Request this service

Duplicate rows removed 12.4%

PII instances scrubbed 847

Low quality rows filtered 3.1%

Final quality score 0.97 / 1.00

04 — Evaluation Datasets

Know if your model is actually getting better

We build held-out evaluation and benchmark sets with human-verified ground truth labels, designed to expose model weaknesses and track fine-tuning progress across training runs.

benchmarks human labels held-out sets annotator agreement

Request this service

1

Task definition

We work with you to define evaluation dimensions, scoring rubrics and edge case coverage.

2

Human annotation

Labels are created and verified by human reviewers with inter-annotator agreement scoring.

3

Benchmark delivery

Eval set in JSONL + Parquet + scoring script. Ready to plug into your eval harness.

05 — RAG Chunking

Retrieval data that actually retrieves

We segment, chunk and enrich documents for Retrieval-Augmented Generation pipelines. Chunking strategy, overlap settings and metadata enrichment are all tuned to your vector database and embedding model.

RAG embeddings chunking metadata vector DB ready

Request this service

"chunk_id": "doc_42_chunk_7",
"text": "The transformer architecture...",
"tokens": 256,
"overlap": 32,
"source_page": 14

Compatible with Pinecone, Weaviate, Qdrant, pgvector, Chroma and more.

06 — Custom Pipelines

Recurring pipelines built around your workflow

For teams with ongoing data needs, we build repeatable batch processing pipelines — weekly jobs, format converters, annotation tooling and automated QA reporting — that run without intervention.

recurring jobs automation annotation tooling QA reports

Let's talk

Weekly batch · runs every Monday 02:00 UTC

Auto QA report delivered by email

2 rows flagged for manual review

48,291 rows delivered · avg score 0.97

07 — On-Premise Data Curation

For organisations that can't send data off-site

Banks, healthcare providers, government agencies and other regulated organisations can have our full curation pipeline — PII scrubbing, deduplication, structuring and quality scoring — run entirely on-premise. No cloud storage, no US servers, AES-256 encryption throughout, delivered under a signed Art. 28 GDPR Data Processing Agreement.

on-premise no cloud AES-256 Art. 28 DPA banks & healthcare government

Discuss on-premise curation

Your servers or ours — Germany only, never cloud

AES-256 encryption at rest, throughout processing

Signed Art. 28 DSGVO Auftragsverarbeitungsvertrag

Zero cloud transfer — by design, not policy

What we do

Training data your model can actually learn from

Schema design

Data transformation

Quality validation

Delivery with data card

Large-scale datasets, columnar and fast

Clean data is the foundation of every good model

Know if your model is actually getting better

Task definition

Human annotation

Benchmark delivery

Retrieval data that actually retrieves

Recurring pipelines built around your workflow

For organisations that can't send data off-site

Not sure which service you need?