Hanja
Getting hanja characters for korean words
 
Loading...
Searching...
No Matches
src.data_processing.DataProcessor Class Reference

A class to process the data of the json file. More...

Collaboration diagram for src.data_processing.DataProcessor:
Collaboration graph

Public Member Functions

 __init__ (self, folder_path)
 
 extract_data (self)
 Extracts data from JSON files in the specified folder.
 
 process_data (self, raw_data)
 Processes raw data to extract and transform necessary fields for further use.
 
 extract_word (self, lemma_datas)
 Extracts the 'val' from the lemma element.
 
 extract_hanja (self, hanja_datas)
 Extracts the 'origin' feat value for hanja from the entry.
 
 extract_equivalents (self, sense_datas)
 Extracts the Equivalent data from the Sense.
 
 extract_pronounciation (self, word_form)
 Extracts the pronounciation data from the wordForm.
 
 extract_language (self, equivalents)
 Extracts the 'language' from the Equivalent.
 
 extract_lemma (self, equivalents)
 Extracts the 'lemma' from the Equivalent.
 
 extract_definition (self, equivalents)
 Extracts the 'definition' from the Equivalent.
 
 extract_korean_definition (self, entry)
 Extracts the 'definition' feat value for the Korean definition from the entry.
 
 read_hanja_file (self, file_path)
 Reads a Hanja data file line by line.
 
 process_hanja_data (self, lines)
 Processes Hanja file lines to extract structured data.
 
 reorder_hanja_results (self, hanja_results, hanja_characters)
 Function to reorder based on the correct list.
 

Public Attributes

 folder_path
 

Detailed Description

A class to process the data of the json file.

Constructor & Destructor Documentation

◆ __init__()

src.data_processing.DataProcessor.__init__ (   self,
  folder_path 
)
@brief Initialize the DataExtractor with the folder path containing the JSON files.
@param folder_path Path to the folder with JSON files.

Member Function Documentation

◆ extract_data()

src.data_processing.DataProcessor.extract_data (   self)

Extracts data from JSON files in the specified folder.

Reads and combines the contents of all JSON files in the folder.

Returns
A list of dictionaries containing the combined data from all JSON files.

◆ extract_definition()

src.data_processing.DataProcessor.extract_definition (   self,
  equivalents 
)

Extracts the 'definition' from the Equivalent.

Parameters
equivalentsA dictionary containing equivalent data.
Returns
The definition value, or None if not found.

◆ extract_equivalents()

src.data_processing.DataProcessor.extract_equivalents (   self,
  sense_datas 
)

Extracts the Equivalent data from the Sense.

Parameters
sense_datasThe Sense data, which can be a list or a dictionary.
Returns
The extracted Equivalent data, or an empty dictionary if not found.

◆ extract_hanja()

src.data_processing.DataProcessor.extract_hanja (   self,
  hanja_datas 
)

Extracts the 'origin' feat value for hanja from the entry.

Parameters
hanja_datasThe hanja data, which can be a list or a dictionary.
Returns
The extracted hanja value, or None if not found.

◆ extract_korean_definition()

src.data_processing.DataProcessor.extract_korean_definition (   self,
  entry 
)

Extracts the 'definition' feat value for the Korean definition from the entry.


Parameters
entryA dictionary representing a single lexical entry.
Returns
The Korean definition if present, otherwise None.

◆ extract_language()

src.data_processing.DataProcessor.extract_language (   self,
  equivalents 
)

Extracts the 'language' from the Equivalent.

Parameters
equivalentsA dictionary containing equivalent data.
Returns
The language value, or None if not found.

◆ extract_lemma()

src.data_processing.DataProcessor.extract_lemma (   self,
  equivalents 
)

Extracts the 'lemma' from the Equivalent.

Parameters
equivalentsA dictionary containing equivalent data.
Returns
The lemma value, or None if not found.

◆ extract_pronounciation()

src.data_processing.DataProcessor.extract_pronounciation (   self,
  word_form 
)

Extracts the pronounciation data from the wordForm.

Parameters
word_formThe wordForm data, which can be a list or a dictionary.
Returns
The extracted sound data, or None if not found.

◆ extract_word()

src.data_processing.DataProcessor.extract_word (   self,
  lemma_datas 
)

Extracts the 'val' from the lemma element.

Parameters
lemma_datasThe lemma data, which can be a list or a dictionary.
Returns
The extracted word value, or None if not found.

◆ process_data()

src.data_processing.DataProcessor.process_data (   self,
  raw_data 
)

Processes raw data to extract and transform necessary fields for further use.

Parameters
raw_dataA list of dictionaries representing raw data entries.
Returns
A list of dictionaries with processed and relevant data.

lexical_entries Represents a single lexical entry in the dictionary, containing various features of the word. Lemma:

  • A list of 'feat' elements that represent the main word form.
  • Each 'feat' contains:
    • 'att': The attribute name, typically 'writtenForm' for the main word.
    • 'val': The actual value, which is the word itself (e.g., '침팬지'). RelatedForm:
  • Represents any related forms of the word, such as variations or inflections.
  • Typically contains 'feat' elements with 'att' representing the type of form (e.g., 'variant') and 'val' representing the word form (e.g., '침팬치'). Sense:
  • Contains detailed information about the meaning(s) of the word.
  • Typically contains the following sub-elements:
    • Equivalent: A list of 'feat' elements representing the word's equivalent forms in other languages. Each 'feat' contains:
      • 'att': The attribute name, such as 'language', 'lemma', or 'definition'.
      • 'val': The corresponding value for the attribute, such as:
        • 'language' -> The language name (e.g., '영어' for English, '프랑스어' for French).
        • 'lemma' -> The word in the equivalent language.
        • 'definition' -> The definition of the word in the equivalent language.
    • SenseExample:
      • Contains example usage of the word, typically in context.
      • 'feat': A list containing:
        • 'att': 'type' represents the type of example (e.g., '문장' for sentence, '대화' for conversation).
        • 'val': The actual example text.
    • att : could represent the identifier for sense.
    • feat: This part contains the different specifications
      • 'att': 'type' represents different specifications of the word such as 'homonym_number', 'partOfSpeech', 'origin', 'vocabularyLevel', 'lexicalUnit'.
      • 'val': The actual value of the specification.
    • val : could represent the id of the word WordForm:
  • Contains pronunciation information or other related word forms.
  • Typically contains 'feat' elements with:
    • 'att': 'pronunciation' for pronunciation details.
    • 'val': The value for pronunciation or another related form. att:
  • Can represent various attributes or features of the word, such as:
    • 'id': The unique identifier for the lexical entry.
    • 'partOfSpeech': The part of speech (e.g., 'noun', 'verb').
    • 'origin': The origin of the word (e.g., '한자' for Hanja-based words). feat:
  • Contains various additional features of the word, such as:
    • 'att': Could include attributes like 'homonym_number', 'partOfSpeech', 'origin', 'vocabularyLevel', 'lexicalUnit'.
    • 'val': The actual value for each feature (e.g., a specific homonym number, part of speech, or origin of the word). val:
  • Can represent a variety of values such as the 'id' of the word, or specific properties like part of speech or lexical unit.

◆ process_hanja_data()

src.data_processing.DataProcessor.process_hanja_data (   self,
  lines 
)

Processes Hanja file lines to extract structured data.

Groups Hanja characters, their corresponding Korean readings, and definitions. Slow because of the translation process

Parameters
linesA list of lines from the Hanja data file.
Returns
A dictionary where keys are Hanja characters and values are lists of corresponding Korean readings and definitions.

◆ read_hanja_file()

src.data_processing.DataProcessor.read_hanja_file (   self,
  file_path 
)

Reads a Hanja data file line by line.

Parameters
file_pathPath to the Hanja data file.
Returns
A list of lines read from the file.

◆ reorder_hanja_results()

src.data_processing.DataProcessor.reorder_hanja_results (   self,
  hanja_results,
  hanja_characters 
)

Function to reorder based on the correct list.

Parameters
hanja_resultsUnordered hanja list
hanjacharacters Ordered hanja list
Returns
ordered_results

Member Data Documentation

◆ folder_path

src.data_processing.DataProcessor.folder_path

The documentation for this class was generated from the following file: