A class to process the data of the json file.
More...
|
| | __init__ (self, folder_path) |
| |
| | extract_data (self) |
| | Extracts data from JSON files in the specified folder.
|
| |
| | process_data (self, raw_data) |
| | Processes raw data to extract and transform necessary fields for further use.
|
| |
| | extract_word (self, lemma_datas) |
| | Extracts the 'val' from the lemma element.
|
| |
| | extract_hanja (self, hanja_datas) |
| | Extracts the 'origin' feat value for hanja from the entry.
|
| |
| | extract_equivalents (self, sense_datas) |
| | Extracts the Equivalent data from the Sense.
|
| |
| | extract_pronounciation (self, word_form) |
| | Extracts the pronounciation data from the wordForm.
|
| |
| | extract_language (self, equivalents) |
| | Extracts the 'language' from the Equivalent.
|
| |
| | extract_lemma (self, equivalents) |
| | Extracts the 'lemma' from the Equivalent.
|
| |
| | extract_definition (self, equivalents) |
| | Extracts the 'definition' from the Equivalent.
|
| |
| | extract_korean_definition (self, entry) |
| | Extracts the 'definition' feat value for the Korean definition from the entry.
|
| |
| | read_hanja_file (self, file_path) |
| | Reads a Hanja data file line by line.
|
| |
| | process_hanja_data (self, lines) |
| | Processes Hanja file lines to extract structured data.
|
| |
| | reorder_hanja_results (self, hanja_results, hanja_characters) |
| | Function to reorder based on the correct list.
|
| |
A class to process the data of the json file.
◆ __init__()
| src.data_processing.DataProcessor.__init__ |
( |
|
self, |
|
|
|
folder_path |
|
) |
| |
@brief Initialize the DataExtractor with the folder path containing the JSON files.
@param folder_path Path to the folder with JSON files.
◆ extract_data()
| src.data_processing.DataProcessor.extract_data |
( |
|
self | ) |
|
Extracts data from JSON files in the specified folder.
Reads and combines the contents of all JSON files in the folder.
- Returns
- A list of dictionaries containing the combined data from all JSON files.
◆ extract_definition()
| src.data_processing.DataProcessor.extract_definition |
( |
|
self, |
|
|
|
equivalents |
|
) |
| |
Extracts the 'definition' from the Equivalent.
- Parameters
-
| equivalents | A dictionary containing equivalent data. |
- Returns
- The definition value, or None if not found.
◆ extract_equivalents()
| src.data_processing.DataProcessor.extract_equivalents |
( |
|
self, |
|
|
|
sense_datas |
|
) |
| |
Extracts the Equivalent data from the Sense.
- Parameters
-
| sense_datas | The Sense data, which can be a list or a dictionary. |
- Returns
- The extracted Equivalent data, or an empty dictionary if not found.
◆ extract_hanja()
| src.data_processing.DataProcessor.extract_hanja |
( |
|
self, |
|
|
|
hanja_datas |
|
) |
| |
Extracts the 'origin' feat value for hanja from the entry.
- Parameters
-
| hanja_datas | The hanja data, which can be a list or a dictionary. |
- Returns
- The extracted hanja value, or None if not found.
◆ extract_korean_definition()
| src.data_processing.DataProcessor.extract_korean_definition |
( |
|
self, |
|
|
|
entry |
|
) |
| |
Extracts the 'definition' feat value for the Korean definition from the entry.
- Parameters
-
| entry | A dictionary representing a single lexical entry. |
- Returns
- The Korean definition if present, otherwise None.
◆ extract_language()
| src.data_processing.DataProcessor.extract_language |
( |
|
self, |
|
|
|
equivalents |
|
) |
| |
Extracts the 'language' from the Equivalent.
- Parameters
-
| equivalents | A dictionary containing equivalent data. |
- Returns
- The language value, or None if not found.
◆ extract_lemma()
| src.data_processing.DataProcessor.extract_lemma |
( |
|
self, |
|
|
|
equivalents |
|
) |
| |
Extracts the 'lemma' from the Equivalent.
- Parameters
-
| equivalents | A dictionary containing equivalent data. |
- Returns
- The lemma value, or None if not found.
◆ extract_pronounciation()
| src.data_processing.DataProcessor.extract_pronounciation |
( |
|
self, |
|
|
|
word_form |
|
) |
| |
Extracts the pronounciation data from the wordForm.
- Parameters
-
| word_form | The wordForm data, which can be a list or a dictionary. |
- Returns
- The extracted sound data, or None if not found.
◆ extract_word()
| src.data_processing.DataProcessor.extract_word |
( |
|
self, |
|
|
|
lemma_datas |
|
) |
| |
Extracts the 'val' from the lemma element.
- Parameters
-
| lemma_datas | The lemma data, which can be a list or a dictionary. |
- Returns
- The extracted word value, or None if not found.
◆ process_data()
| src.data_processing.DataProcessor.process_data |
( |
|
self, |
|
|
|
raw_data |
|
) |
| |
Processes raw data to extract and transform necessary fields for further use.
- Parameters
-
| raw_data | A list of dictionaries representing raw data entries. |
- Returns
- A list of dictionaries with processed and relevant data.
lexical_entries Represents a single lexical entry in the dictionary, containing various features of the word. Lemma:
- A list of 'feat' elements that represent the main word form.
- Each 'feat' contains:
- 'att': The attribute name, typically 'writtenForm' for the main word.
- 'val': The actual value, which is the word itself (e.g., '침팬지'). RelatedForm:
- Represents any related forms of the word, such as variations or inflections.
- Typically contains 'feat' elements with 'att' representing the type of form (e.g., 'variant') and 'val' representing the word form (e.g., '침팬치'). Sense:
- Contains detailed information about the meaning(s) of the word.
- Typically contains the following sub-elements:
- Equivalent: A list of 'feat' elements representing the word's equivalent forms in other languages. Each 'feat' contains:
- 'att': The attribute name, such as 'language', 'lemma', or 'definition'.
- 'val': The corresponding value for the attribute, such as:
- 'language' -> The language name (e.g., '영어' for English, '프랑스어' for French).
- 'lemma' -> The word in the equivalent language.
- 'definition' -> The definition of the word in the equivalent language.
- SenseExample:
- Contains example usage of the word, typically in context.
- 'feat': A list containing:
- 'att': 'type' represents the type of example (e.g., '문장' for sentence, '대화' for conversation).
- 'val': The actual example text.
- att : could represent the identifier for sense.
- feat: This part contains the different specifications
- 'att': 'type' represents different specifications of the word such as 'homonym_number', 'partOfSpeech', 'origin', 'vocabularyLevel', 'lexicalUnit'.
- 'val': The actual value of the specification.
- val : could represent the id of the word WordForm:
- Contains pronunciation information or other related word forms.
- Typically contains 'feat' elements with:
- 'att': 'pronunciation' for pronunciation details.
- 'val': The value for pronunciation or another related form. att:
- Can represent various attributes or features of the word, such as:
- 'id': The unique identifier for the lexical entry.
- 'partOfSpeech': The part of speech (e.g., 'noun', 'verb').
- 'origin': The origin of the word (e.g., '한자' for Hanja-based words). feat:
- Contains various additional features of the word, such as:
- 'att': Could include attributes like 'homonym_number', 'partOfSpeech', 'origin', 'vocabularyLevel', 'lexicalUnit'.
- 'val': The actual value for each feature (e.g., a specific homonym number, part of speech, or origin of the word). val:
- Can represent a variety of values such as the 'id' of the word, or specific properties like part of speech or lexical unit.
◆ process_hanja_data()
| src.data_processing.DataProcessor.process_hanja_data |
( |
|
self, |
|
|
|
lines |
|
) |
| |
Processes Hanja file lines to extract structured data.
Groups Hanja characters, their corresponding Korean readings, and definitions. Slow because of the translation process
- Parameters
-
| lines | A list of lines from the Hanja data file. |
- Returns
- A dictionary where keys are Hanja characters and values are lists of corresponding Korean readings and definitions.
◆ read_hanja_file()
| src.data_processing.DataProcessor.read_hanja_file |
( |
|
self, |
|
|
|
file_path |
|
) |
| |
Reads a Hanja data file line by line.
- Parameters
-
| file_path | Path to the Hanja data file. |
- Returns
- A list of lines read from the file.
◆ reorder_hanja_results()
| src.data_processing.DataProcessor.reorder_hanja_results |
( |
|
self, |
|
|
|
hanja_results, |
|
|
|
hanja_characters |
|
) |
| |
Function to reorder based on the correct list.
- Parameters
-
| hanja_results | Unordered hanja list |
| hanja | characters Ordered hanja list |
- Returns
- ordered_results
◆ folder_path
| src.data_processing.DataProcessor.folder_path |
The documentation for this class was generated from the following file: