Production-Grade SHA-256 Verified Perpetual License

The Complete Phytochemical
Intelligence Database

104,388 structured records. 24,771 unique compounds. 2,315 plant species. PubMed-enriched. Production-ready JSON + Parquet.

Dataset at a Glance

104,388

Total Records

24,771

Unique Compounds

2,315

Plant Species

🔗 Source: USDA Dr. Duke's Database

📚 PubMed-enriched via NCBI E-utils (2026)

✅ SHA-256 verified · 3-pass export protocol

⭐ Sample on GitHub

Why This Dataset

🔬

PubMed-Enriched

Every compound cross-referenced against PubMed (2026). Publication mention counts included — no manual literature search needed.

🧬

USDA-Sourced

Built from Dr. Duke's Phytochemical and Ethnobotanical Databases (USDA). No LLM-generated or synthetic data. Every record traceable to peer-reviewed sources.

⚡

ML-Ready Formats

Delivered as JSON (16.4 MB) and Parquet with Snappy compression (761 KB). Load directly into pandas, Spark, or any data pipeline.

🔒

Integrity-Verified

SHA-256 checksums for both files. Manifest with full schema, null counts, and export metadata. 3-pass verification protocol.

Schema

Column	Type	Nulls	Description
`chemical`	string	0	Compound name (natural key)
`plant_species`	string	0	Botanical species name (natural key)
`application`	string	47,324	Documented bioactivity / therapeutic use
`dosage`	string	90,340	Documented dosage from literature
`pubmed_mentions_2026`	Int64	0	PubMed publication mention count (title/abstract, 2026)

ℹ️

Null values in application and dosage reflect genuine gaps in the source USDA literature — not data engineering errors. These fields document what has been published, not what has been assumed. A full null-count breakdown and field completeness report is included in the SHA-256 Manifest delivered with the dataset.

Sample Record

ethno_dataset_2026.json

{
  "chemical": "QUERCETIN",
  "plant_species": "Camellia sinensis",
  "application": "Antiinflammatory",
  "dosage": "500 mg/day",
  "pubmed_mentions_2026": 31310
}

What You Get

📦 JSON

16.4 MB

ethno_dataset_2026.json
Array of 104,388 objects. UTF-8. Ready for any language or tool.

🗜️ Parquet

761 KB

ethno_dataset_2026.parquet
Snappy-compressed columnar format. 22× smaller. Ideal for pandas, Spark, DuckDB.

SHA-256 (JSON): af37e01920aca629ed29e0c2716ebea002754924f7f4f1c7f61826c1b6056c0f
SHA-256 (Parquet): e112071d43ca986ef7048532c46622e6e167f2199b592f7ed7eca91c38c82b9b

Built For

🤖

RAG Pipelines

Ground your LLM in real phytochemical data. Embed compounds + species pairs for retrieval-augmented generation with zero hallucinations.

💊

Drug Discovery

Screen 24,771 compounds across 2,315 species. Filter by bioactivity, cross-reference PubMed mentions, identify research gaps.

📊

Market Intelligence

Map the nutraceutical landscape. Identify trending compounds by PubMed publication volume. Track emerging plant-based ingredients.

🎓

Academic Research

Ready-made training data for phytochemistry ML models. Paired compound-species-activity records with literature-backed enrichment.

FREQUENTLY ASKED QUESTIONS

Can I test the data before purchasing?

Yes. A free 400-row sample (the 400 most PubMed-cited compounds) is available on GitHub. It contains the full schema, real values, and the same JSON/Parquet format as the production dataset.

How and when will I receive the files?

After payment via Stripe, you will receive a secure download link by email within 24 hours. The ZIP archive contains: ethno_dataset_2026.json (16.4 MB), ethno_dataset_2026.parquet (761 KB, Snappy-compressed), and a SHA-256 manifest file.

What license do I get?

A perpetual, non-exclusive commercial license for internal use (MLOps pipelines, RAG systems, vector embeddings, research). Redistribution or resale of the raw dataset is not permitted. See full terms below.

Why are there null values in the application and dosage fields?

These reflect genuine gaps in the source USDA literature — not data engineering errors. Approximately 54% of application fields and 86% of dosage fields have documented values. The null distribution is itself a scientifically meaningful signal about research coverage.

Is this compatible with my tech stack?

JSON (Array of Objects, UTF-8) loads directly into Python (pandas, json), Node.js, R, and any language with a JSON parser. Parquet (Snappy compression) is immediately readable by pandas, Apache Spark, DuckDB, Polars, and BigQuery.

Is VAT included in the price?

The operator is a Kleinunternehmer per §19 UStG (German small business regulation). No VAT is charged. The listed price of €699 is the final price.

What are the team and enterprise options?

Team (€1.349): Everything in Single + 20 DuckDB queries, priority score calculator, 4 pre-computed views. Unlimited internal users. Enterprise (€1.699): Everything in Team + Snowflake, ChromaDB, Pinecone integration scripts + embedding guide + opportunity matrix. Multi-entity use. See pricing →

⚡ Cost comparison: Build vs. Buy

Self-sourcing this dataset ~60 hrs × $85/hr ≈ $5,100

Ethno-API Dataset One-time license €699

You save ~$4,400 in data engineering costs.

Single Entity

€699

JSON + Parquet + SHA-256 Manifest.
1 legal entity, internal use, perpetual license.

Buy Single Entity →

Team

€1.349

Everything in Single + duckdb_queries.sql (20 Queries) + compound_priority_score.py + 4 Pre-computed Views. Unlimited internal users.

Buy Team →

Enterprise

€1.699

Everything in Team + snowflake_load.sql + chromadb_ingest.py + pinecone_ingest.py + embedding_guide.md + Opportunity Matrix. Multi-entity / group use.

Buy Enterprise →

Gemäß § 19 UStG wird keine Umsatzsteuer berechnet. Alle Preise netto. One-time purchase — no subscription.

→ Download the free 400-row sample first (GitHub) · View on GitHub ↗

📦 You will receive a download link via email within 24 hours after payment. Includes: ethno_dataset_2026.json (16.4 MB) + ethno_dataset_2026.parquet (761 KB) + SHA-256 Manifest. Team + Enterprise tiers include additional analytics artifacts.

Dataset files delivered to your email within 24 hours after payment confirmation.
No subscription required.