OpenQuizz
Une application de gestion des contenus pédagogiques
CharSetProber Class Reference
Inheritance diagram for CharSetProber:
Collaboration diagram for CharSetProber:

Public Member Functions

def __init__ (self, lang_filter=None)
 
def reset (self)
 
def charset_name (self)
 
def feed (self, buf)
 
def state (self)
 
def get_confidence (self)
 

Static Public Member Functions

def filter_high_byte_only (buf)
 
def filter_international_words (buf)
 
def filter_with_english_letters (buf)
 

Data Fields

 lang_filter
 
 logger
 

Static Public Attributes

float SHORTCUT_THRESHOLD = 0.95
 

Constructor & Destructor Documentation

◆ __init__()

def __init__ (   self,
  lang_filter = None 
)

Member Function Documentation

◆ charset_name()

◆ feed()

◆ filter_high_byte_only()

def filter_high_byte_only (   buf)
static

◆ filter_international_words()

def filter_international_words (   buf)
static
We define three types of bytes:
alphabet: english alphabets [a-zA-Z]
international: international characters [\x80-\xFF]
marker: everything else [^a-zA-Z\x80-\xFF]

The input buffer can be thought to contain a series of words delimited
by markers. This function works to filter all words that contain at
least one international character. All contiguous sequences of markers
are replaced by a single space ascii character.

This filter applies to all scripts which do not use English characters.

◆ filter_with_english_letters()

def filter_with_english_letters (   buf)
static
Returns a copy of ``buf`` that retains only the sequences of English
alphabet and high byte characters that are not between <> characters.
Also retains English alphabet and high byte characters immediately
before occurrences of >.

This filter can be applied to all scripts which contain both English
characters and extended ASCII characters, but is currently only used by
``Latin1Prober``.

◆ get_confidence()

◆ reset()

◆ state()

def state (   self)

Reimplemented in HebrewProber.

Field Documentation

◆ lang_filter

lang_filter

◆ logger

logger

◆ SHORTCUT_THRESHOLD

float SHORTCUT_THRESHOLD = 0.95
static

The documentation for this class was generated from the following file: