104,388 structured records. 24,771 unique compounds. 2,315 plant species. PubMed-enriched. Production-ready JSON + Parquet.
Every compound cross-referenced against PubMed (2026). Publication mention counts included — no manual literature search needed.
Built from Dr. Duke's Phytochemical and Ethnobotanical Databases (USDA). No LLM-generated or synthetic data. Every record traceable to peer-reviewed sources.
Delivered as JSON (16.4 MB) and Parquet with Snappy compression (761 KB). Load directly into pandas, Spark, or any data pipeline.
SHA-256 checksums for both files. Manifest with full schema, null counts, and export metadata. 3-pass verification protocol.
| Column | Type | Nulls | Description |
|---|---|---|---|
chemical | string | 0 | Compound name (natural key) |
plant_species | string | 0 | Botanical species name (natural key) |
application | string | 47,324 | Documented bioactivity / therapeutic use |
dosage | string | 90,340 | Documented dosage from literature |
pubmed_mentions_2026 | Int64 | 0 | PubMed publication mention count (title/abstract, 2026) |
Null values in application and dosage reflect genuine gaps
in the source USDA literature — not data engineering errors. These fields document
what has been published, not what has been assumed.
A full null-count breakdown and field completeness report is included in the
SHA-256 Manifest delivered with the dataset.
{
"chemical": "QUERCETIN",
"plant_species": "Camellia sinensis",
"application": "Antiinflammatory",
"dosage": "500 mg/day",
"pubmed_mentions_2026": 31310
}
ethno_dataset_2026.json
Array of 104,388 objects. UTF-8. Ready for any language or tool.
ethno_dataset_2026.parquet
Snappy-compressed columnar format. 22× smaller. Ideal for pandas, Spark, DuckDB.
Ground your LLM in real phytochemical data. Embed compounds + species pairs for retrieval-augmented generation with zero hallucinations.
Screen 24,771 compounds across 2,315 species. Filter by bioactivity, cross-reference PubMed mentions, identify research gaps.
Map the nutraceutical landscape. Identify trending compounds by PubMed publication volume. Track emerging plant-based ingredients.
Ready-made training data for phytochemistry ML models. Paired compound-species-activity records with literature-backed enrichment.
ethno_dataset_2026.json (16.4 MB), ethno_dataset_2026.parquet (761 KB, Snappy-compressed), and a SHA-256 manifest file.⚡ Cost comparison: Build vs. Buy
You save ~$4,400 in data engineering costs.
JSON + Parquet + SHA-256 Manifest.
1 legal entity, internal use, perpetual license.
Everything in Single + duckdb_queries.sql (20 Queries) + compound_priority_score.py + 4 Pre-computed Views. Unlimited internal users.
Everything in Team + snowflake_load.sql + chromadb_ingest.py + pinecone_ingest.py + embedding_guide.md + Opportunity Matrix. Multi-entity / group use.
Gemäß § 19 UStG wird keine Umsatzsteuer berechnet. Alle Preise netto. One-time purchase — no subscription.
📦 You will receive a download link via email within 24 hours after payment.
Includes: ethno_dataset_2026.json (16.4 MB) +
ethno_dataset_2026.parquet (761 KB) + SHA-256 Manifest.
Team + Enterprise tiers include additional analytics artifacts.
Dataset files delivered to your email within 24 hours after payment confirmation.
No subscription required.