Commit d16702e1 authored by Elias's avatar Elias

Grouping of elements in the readme file

parent f7c909d2
# Scripts
# Corpus Analysis and Dataset Construction
**TrainValTest1_csvAPartirDeTextGrid.py :** This script reads a TextGrid transcription file from the textGrid folder and retrieves the xmin,xmax,text data of speaker1 and speaker2 and saves them in two different files with .csv extension for each speaker. These two files will be placed in the textGridEnCSV/ folder.
## 1. Data Preparation
### Extraction elements from files
**TrainValTest1_csvAPartirDeTextGrid.py :** This script reads a TextGrid transcription file from the textGrid folder and retrieves the start time,end time and the corresponding text of speaker1 and speaker2 and saves them in two different files with .csv extension for each speaker. These two files will be placed in the textGridEnCSV/ folder.
```bash
python3 TrainValTest1_csvAPartirDeTextGrid.py textGrid
```
## 2. Partitioning the corpus
### Mixing and Separation
**TrainValTest2_Random.py :** This script reads each csv file from the textGridEnCSV folder, shuffles the lines where text is non-empty and splits into train/val/test.
```bash
python3 TrainValTest2_Random.py > resTrainValTest2_Random.txt
```
## 3. Lexical Analysis
### Extraction of Linguistic Units
**TrainValTest3_Random.py :** This script allows you to extract the words present in the Train.
```bash
......@@ -24,6 +30,8 @@ python3 TrainValTest3_Random.py
python3 TrainValTest3_bis_Random.py
```
## 4. Counting and Filtering
### Occurrence Statistics
**TrainValTest4_Random.py :** Now you can count the number of occurrences of the words from Train in Train/Val/Test.
```bash
......@@ -42,12 +50,15 @@ python3 TrainValTest5_Random.py
python3 TrainValTest6_Random.py
```
## 5. Binning and Visualization
### Frequency Grouping
**TrainValTest7_Random.py :** This script groups all the words into 10 bins (each bin represents 10% of the sum of occurrences of the words in the Train), this will allow you to choose words from all the bins for the test phase (there will be bins where the words are the least repeated in the corpus and bins where the words are the most repeated in the corpus).
```bash
python3 TrainValTest7_Random.py
```
### Graphics
**TrainValTest8_Random.py :** Bar chart to display the number of words per Bin.
```bash
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment