Speech Processing

Teaching Staff: Karydis Ioannis
Code: MO310
Course Type: Direction of BCI - Compulsory
Course Level: Undergraduate
Course Language: Greek
Semester: 8th
ECTS: 5
Teaching Units: 3
Lecture Hours: 2
Lab/Tutorial Hours: 2L
Total Hours: 4
E Class Page: https://opencourses.ionio.gr/courses/DDI119/
Curricula: Revamped Curriculum in Informatics from 2025

Short Description:

Speech processing is the field of science that deals with the analysis, processing, understanding and synthesis of the human voice by computer systems. It is a basic subject of speech technology and has applications in areas such as speech recognition and understanding, voice synthesis, dialogue systems, voice biometrics and assistive technologies for people with disabilities. In modern times, speech technologies are integrated into smart devices, virtual assistants and multimedia interaction environments.

The theoretical part of the course studies: Modeling the speech production mechanism: Speech production mechanism, Speech sounds. Digital preprocessing of speech text: Selection of sampling frequency, Digitization, Short-term analysis of speech signal, Selection of frame length, Pre-emphasis, Selection of "window" filter, Frame movement rate. Acoustic parameters: Parameter extraction, Acoustic information for speaker discrimination, Energy and zero crossings, Fundamental frequency, Tonality calculation methods, Spectrogram, Vocal channel resonances (FORMANTS), Linear prediction coefficients (LPC), filter bank, reflection coefficients, Cepstral coefficients. Basic speech processing techniques. Hidden Markov models: Definition and fundamental algorithms. Speech recognition/understanding systems, Speaker recognition systems. Speech synthesis. Digital noise removal techniques.

The laboratory part includes practical application of speech processing techniques, as presented in the theoretical part, using the open source software Octave GNU.

Objectives - Learning Outcomes:

Students' understanding of the basic concepts of speech processing. Cultivating scientific thinking around the issues of speech processing technologies as well as their extensions.

Students will also have the opportunity to:

understand the stages of development of speech processing applications,
be able to design, develop and manage corresponding processes,
detect the opportunities for the development of the technology,
come into contact with related research issues.

Upon completion of the thematic modules, students are able to:

plan the development of speech processing,
implement speech processing usage scenarios,
execute the design and development of a complete software package furnishing speech processing capabilities, and also
detect business-professional opportunities.

Syllabus:

Introduction to Signal Processing

Introduction to Digital Signal Processing

What is a discrete signal and what is a discrete system
Fourier Transform
Signal Sampling and Digitization

Speech Production

Speech Production
Modeling the Speech Production System

The three parameters that allow any acoustic phenomenon to be characterized are intensity, frequency, and time. The perception of sound intensity in humans depends on the frequency of the sound

Basic steps in speech production

Formation of the idea that we want to communicate
Conversion of the idea into a linguistic structure using related words and phrases
Classification of words based on grammatical rules determined by the language used
Addition of features such as frequency, intensity
The brain produces a series of commands that move the vocal system which in turn produces the sound (acoustic) waves

Speech Preprocessing & Speech Parameterization

How do we convert speech from an acoustic signal to a digital one
Introduction to speech parameterization: how can we keep from a speech signal only the parameters that express it

Speech digitization includes the following steps:

conversion of the acoustic signal into electrical
amplification of the level of the electrical signal coming from the microphone
passage of the acoustic signal through low-pass filter to cut off high frequencies
converting the analog signal to digital
separating the digital speech signal into short time frames (framing)

Speech Parameterization

Speech parameterization: how can we keep from a speech signal only the parameters that express it

High information redundancy of digitized signal data

extraction of appropriate parameters
only necessary information for a specific use
result: substantial data volume compression and easy use

Modeling parameter requirements:

High recognition reliability
Short computational time required for their determination
Small information flow

Linear Predictive Coding

Based on the previous values of a function, can we calculate its value at position n?
The concept of uncertainty is introduced in predicting a future value of a function

Predicting future values based on existing (known) values is widely used:

Meteorology
Stock market
Signal coding/compression (image/sound/data) for transmission
In biology for predicting population evolution

Linear prediction is a simple method of predicting future values based on a linear combination of existing values

Speech Recognition

Why speech recognition?

Speech is the dominant and most widespread way of human communication
The best human-computer interface!
Most computer users speak faster than they type.
Humans speak first, then write: computer use from toddlers
People with limited motor skills (or even limited education) will be able to use computers
More natural communication with television, kitchen, coffee maker, front door (intelligent homes)
Virtual Reality systems with speech recognition
Computer/console games

A speech recognition system is a system that transcribes speech into text

It acts like a typist, it "listens" to what the user says and converts it into written speech
Speech recognition does not imply speech understanding
Understanding falls within the field of artificial intelligence (AI)
Many systems can recognize speech, none can truly "understand" it today

Speaker Recognition

A speech recognition system is a system that converts the speech signal into text

We are interested in what the speaker is saying
Speech recognition does not imply and speech understanding

Speech analysis: A speech recognition system uses speech parameterization methods that we have come to know

Speech recognition systems are categorized according to

the type of speech they can recognize
the type of speaker
the size of the vocabulary they support
the recognition unit
the recognition technique

Speech synthesis

Speech synthesis is the conversion of text into a speech signal

What is the difference between a speech synthesis system and a CD-player?
A speech synthesis system algorithmically generates a new speech signal, while a CD-player or an mp3 file reproduces a stored speech signal.

Speech synthesis applications

Human-computer interface
People with speech impairments
People with visual impairments
Telecommunications (reading messages, directory information, telephone information, news, etc.)
Entertainment (videogames)

Suggested Bibliography:

“Ψηφιακή Επεξεργασία Φωνής: Θεωρία και Εφαρμογές”, Rabiner L., 2011. Eudoxus code: 13256964
“ΓΛΩΣΣΕΣ ΚΑΙ ΔΙΕΠΑΦΕΣ ΣΤΗ ΜΟΥΣΙΚΗ ΠΛΗΡΟΦΟΡΙΚΗ”, ΔΙΟΝΥΣΙΟΣ ΠΟΛΙΤΗΣ, 2007. Eudoxus code: 13630. https://repository.kallipos.gr/handle/11419/2045
“ΤΕΧΝΟΛΟΓΙΑ ΟΜΙΛΙΑΣ”, ΝΙΚΟΣ ΦΑΚΩΤΑΚΗΣ, in the respective course in opencourses
Lecture slides (only for support)
Labs (code and explanations)
“Audio Processing and Speech Recognition”, with your academic credentials

Teaching Methods: