Senior Speech Dataset — Licensing | Cephalgo

Bringing speech innovation into clinical practice.

A senior-focused speech corpus for the next generation of voice health AI.

A multilingual speech dataset dedicated to older adults, recorded under harmonized protocols and ethical oversight — purpose-built for AI and ML teams advancing voice models toward real-world clinical and well-being use.

Speaker Coverage

50+ to 80+

Speakers spanning four decades of older adulthood, with consistent representation across each age band.

Balanced cohorts

Speaker recruitment is balanced across self-reported gender categories within each language.

Consistent ratios

Speaker ratios held consistent across all languages, so cross-language modeling is meaningful from the start.

Validated recording

Consistent acoustic environments and recording standards, validated for downstream modeling.

Language Coverage

Seven languages, harmonized under a shared protocol so cross-language modeling is meaningful out of the box. Each language uses locally appropriate stimuli and licensed materials where required, with metadata aligned across the full corpus.

Recording Tasks

Picture description

Connected speech elicited from a visual stimulus, supporting analysis of fluency, content, and discourse structure.

Narrative recall

Short structured narratives capturing memory and language organization.

Verbal fluency

Semantic and phonemic generation tasks.

Read passages

Language-matched standardized passages for prosodic and acoustic analysis.

Phonation & paralinguistic

Sustained phonation and elicitations capturing voice quality, pitch, and articulation.

Free speech

Open-ended prompts for naturalistic, spontaneous modeling.

Speaker Context

Each speaker is documented with rich contextual metadata, giving model developers a meaningful clinical anchor without requiring access to identifiable medical records.

Cognitive status

Captured through widely recognized screening instruments, selected for cross-language comparability.

Physical status

Self-reported indicators relevant to respiratory, cardiovascular, and pain status.

Mood & well-being

Standardized mood and affective screening scores.

Demographics

Age, self-reported gender, language background, and education.

Recording context

Device class, acoustic environment, task code, and consent version.

Multi-dimensional context enables modeling not only of cognitive decline, but also of the broader signals that voice can carry.

Ethics & Compliance

Ethics-cleared

Study protocol cleared across all study sites, under independent oversight.

GDPR-conformant

Consent, data handling, and subject rights aligned with European data protection standards.

Layered consent

Covering research use, commercial licensing, and model training, with subject rights honored throughout the data lifecycle.

Provenance documentation

Available to licensees, including DPIA, ROPA, and DPA.

Identity verification

Ensuring data integrity and preventing duplication across sources.

Full compliance documentation is shared with licensees as part of the onboarding package.

Let's talk about your project.

Tell us what you're building, the languages you need, and your downstream goals. We'll come back with a tailored proposal.