Non-THD Training Data

GenRender6 - 50:50 GPU Split ← Back to THD Dashboard
Updated: Apr 15, 2026
Overview
T2I (~1.2B+)
IE Single-Ref (3.2M)
MultiRef (721K)
THD vs Non-THD
Total Non-THD Rows
~4M+
across ~40 parquets
T2I Stock
1.2B
disco-feature-store
IE Single-Ref
3.21M
18 task categories
MultiRef
721K
10 glow_worm parquets
S3 Buckets
5
clio, disco, hz-os, myedit, genie
GPU Allocation
50%
28 GPUs THD : 28 GPUs non-THD
IE Single-Ref Samples 3.21M pairs across 18 tasks
me_attribute, osvt_extraction, depth, colorize, edge, harmonize, ocr, stylize...
View Details ➔
ME_ATTRIBUTE
OSVT_EXTRACTION
ME_ATTRIBUTE
MultiRef Samples 721K triplets from glow_worm
research, pixel-aligned, tryon, character consistency, layered data...
View Details ➔
Ref 1
Ref 2
Target
Ref 1
Ref 2
Target
Non-THD vs THD Data Volume
IE Single-Ref
Non-THD 3.21M (98.7%)
THD 41K
MultiRef
Non-THD 721K (99.4%)
THD 4K
THD data is <1.5% of total - the model learns general capabilities from non-THD, then THD data teaches domain-specific product knowledge.
Non-THD T2I Data disco-feature-store + clio.corp.adobe.com
Stock (lean_and_long)
1.2B
disco-feature-store, ~44GB per shard
Express Templates
441.7K
template rendering images
Synthetic Text
38.7M
synthetic text overlays
Augmented Design
57K
design templates
Multipage
637K
multi-page documents
Parquet Sources
SourceRowsS3 BucketConfig
Stock (lean_and_long gen6)~1.2Bdisco-feature-store/features-main/lean_and_long/gen6/train/img_1024px_20251204.yaml
Express Template Rendering441,700clio.corp.adobe.com/njindal_parquet/scratch/img_1024px
Synthetic Text38,700,000clio.corp.adobe.com/njindal_parquet/scratch/img_1024px
Augmented Design57,000clio.corp.adobe.com/njindal_parquet/scratch/img_1024px
Multipage Combined637,000clio.corp.adobe.com/njindal_parquet/scratch/img_1024px
Sample T2I Stock Images from mldp-image bucket (QOI format, decoded to JPEG)
Note: The 1.2B stock dataset is the backbone of T2I training. It uses disco-feature-store bucket with partitioned parquets (~44GB each shard). Images stored in QOI format on mldp-image bucket. This is the same dataset used in production GR6 training. THD's 162K stock images are a tiny filtered subset. All images have SCAP v2 JSON captions.
Non-THD IE Single-Ref Data clio5/reference_captions - 18 task parquets
Total IE Rows
3.21M
18 parquets (v3.0 + v3.1)
Largest
1.0M
scs_filtered_batch1
Second
987K
me_stock_human
Task Types
18
depth, edge, colorize, ocr...
Caption Model
Qwen3
32B, Nov 2025
vs THD IE
78x
3.21M vs 41K THD
IE Dataset Size by Task (3.21M total)
Sample IE Pairs From me_attribute and osvt_extraction
Schema (shared across all 18 parquets)
edit_instruction Natural language editing instruction
reference_images Array of source image S3 paths
target_image Result image S3 path or hash
before_caption SCAP v2 JSON of source
after_caption SCAP v2 JSON of target
dataset_source Task type identifier
ie_score / dists_score / edge_sim_score Quality metrics
is_bad_caption / is_same_caption / same_ar Filter flags
Training filters: is_same_caption=False, is_bad_caption=False, length(after_caption)>500, same_ar=True
Non-THD MultiRef Data glow_worm parquets - 10 datasets
Total MultiRef
721K
10 glow_worm parquets
Largest
250K
research
Character
169K
consistency shots
Layered
174K
layered + augmented
Tryon
11.5K
virtual try-on
vs THD MR
180x
721K vs 4K THD
MultiRef Dataset Sizes (721K total)
Sample MultiRef Triplets From glow_worm research dataset
Schema
reference_images Array of ref image S3 paths (2-4 refs)
target_image Composite result S3 path
edit_instruction Detailed composition instruction
before_caption / after_caption SCAP v2 JSON
num_reference Number of reference images
dataset_source Source dataset identifier
S3 bucket: hz-os-data/glow-worm/editing/multi-reference/
Training filters: num_reference 2-4, edit_instruction IS NOT NULL, is_same_caption=False
THD vs Non-THD Side-by-Side Comparison
Training Strategy: The model uses 50:50 GPU allocation (28 GPUs THD : 28 GPUs non-THD). YAML comment mentions "recommended 80-20" but actual config allocates equal GPU groups per task. Non-THD data teaches the model general image understanding (depth, edges, colorization, object extraction, text editing), while THD data teaches Home Depot-specific product knowledge.

Key Insight: Despite 50:50 GPU split, non-THD datasets are 78-180x larger in row count. THD data gets heavily cycled/repeated during training while non-THD data is sampled more sparsely. Source: 20260330_lite_thd.yaml line 1847 + GPU group allocation lines 1758-1770 (branch: joshuamunoz/multiref).
MetricTHDNon-THDRatio
T2I Data224.7K (62K HD + 162K stock)~1.2B stock + 39M synthetic + 1M other~5,400x
IE Single-Ref40,9643,208,65778x
MultiRef4,017721,129180x
IE Task Types1 (THD video frames)18 (depth, edge, colorize, ocr, stylize...)18x variety
IE SourcesYT + Brightcove + MissionsStock, MADA, Express, Abaka, OSVT5x+ sources
MultiRef Types1 (vendor scene+tool)6 (research, tryon, character, layered...)6x variety
S3 Bucketsfoundry-thd, adobe-xingtailclio, disco, hz-os, myedit, genieMore distributed
Caption ModelGPT-4 (March 2026)Qwen3-32B (Nov 2025)Different models
GPU Allocation28 GPUs (50%)28 GPUs (50%)1:1 equal
T2I GPUs12 GPUs (512+1024+2048)12 GPUs (512+1024+2048)1:1
IE GPUs8 GPUs (1024+2048)8 GPUs (1024)1:1
MultiRef GPUs8 GPUs (1024+2048)8 GPUs (1024)1:1
Data CyclingHeavy repeat (small data)Sparse sampling (large data)THD sees same data ~100x more
YAML Comment"Use recommended 80-20 for THD vs non-THD" - but actual config is 50:50Mismatch
Implication for Data Curation
Why adding THD data has outsized impact:
1. MultiRef is most starved: 4K THD vs 721K non-THD = 180x gap. Even 1K new THD triplets = 25% increase in THD multiref data.
2. IE has moderate gap: 41K THD vs 3.2M non-THD = 78x. THD IE frames are all from HD-specific video sources.
3. T2I is least starved: 225K THD-adjacent vs 1.2B general stock. More T2I data helps but marginal returns are lower.
4. Non-THD teaches general skills: depth estimation, edge detection, colorization, text editing, style transfer - THD data doesn't need to cover these.
5. THD uniqueness: Product placement in home scenes, tool-in-use compositions, branded product swaps - these can ONLY come from THD data.