A class to process the data of the json file. More...

Collaboration diagram for src.data_processing.DataProcessor:

Public Member Functions
	__init__ (self, folder_path)

	extract_data (self)
	Extracts data from JSON files in the specified folder.

	process_data (self, raw_data)
	Processes raw data to extract and transform necessary fields for further use.

	extract_word (self, lemma_datas)
	Extracts the 'val' from the lemma element.

	extract_hanja (self, hanja_datas)
	Extracts the 'origin' feat value for hanja from the entry.

	extract_equivalents (self, sense_datas)
	Extracts the Equivalent data from the Sense.

	extract_pronounciation (self, word_form)
	Extracts the pronounciation data from the wordForm.

	extract_language (self, equivalents)
	Extracts the 'language' from the Equivalent.

	extract_lemma (self, equivalents)
	Extracts the 'lemma' from the Equivalent.

	extract_definition (self, equivalents)
	Extracts the 'definition' from the Equivalent.

	extract_korean_definition (self, entry)
	Extracts the 'definition' feat value for the Korean definition from the entry.

	read_hanja_file (self, file_path)
	Reads a Hanja data file line by line.

	process_hanja_data (self, lines)
	Processes Hanja file lines to extract structured data.

	reorder_hanja_results (self, hanja_results, hanja_characters)
	Function to reorder based on the correct list.

Public Attributes
	folder_path

Detailed Description

A class to process the data of the json file.

Constructor & Destructor Documentation

◆ init()

src.data_processing.DataProcessor.__init__	(	self,
		folder_path
	)

@brief Initialize the DataExtractor with the folder path containing the JSON files.
@param folder_path Path to the folder with JSON files.

Member Function Documentation

◆ extract_data()

src.data_processing.DataProcessor.extract_data ( self )

Extracts data from JSON files in the specified folder.

Reads and combines the contents of all JSON files in the folder.

Returns: A list of dictionaries containing the combined data from all JSON files.

◆ extract_definition()

src.data_processing.DataProcessor.extract_definition	(	self,
		equivalents
	)

Extracts the 'definition' from the Equivalent.

Parameters

equivalents A dictionary containing equivalent data.

Returns: The definition value, or None if not found.

◆ extract_equivalents()

src.data_processing.DataProcessor.extract_equivalents	(	self,
		sense_datas
	)

Extracts the Equivalent data from the Sense.

Parameters

sense_datas The Sense data, which can be a list or a dictionary.

Returns: The extracted Equivalent data, or an empty dictionary if not found.

◆ extract_hanja()

src.data_processing.DataProcessor.extract_hanja	(	self,
		hanja_datas
	)

Extracts the 'origin' feat value for hanja from the entry.

Parameters

hanja_datas The hanja data, which can be a list or a dictionary.

Returns: The extracted hanja value, or None if not found.

◆ extract_korean_definition()

src.data_processing.DataProcessor.extract_korean_definition	(	self,
		entry
	)

Extracts the 'definition' feat value for the Korean definition from the entry.

Parameters

entry A dictionary representing a single lexical entry.

Returns: The Korean definition if present, otherwise None.

◆ extract_language()

src.data_processing.DataProcessor.extract_language	(	self,
		equivalents
	)

Extracts the 'language' from the Equivalent.

Parameters

equivalents A dictionary containing equivalent data.

Returns: The language value, or None if not found.

◆ extract_lemma()

src.data_processing.DataProcessor.extract_lemma	(	self,
		equivalents
	)

Extracts the 'lemma' from the Equivalent.

Parameters

equivalents A dictionary containing equivalent data.

Returns: The lemma value, or None if not found.

◆ extract_pronounciation()

src.data_processing.DataProcessor.extract_pronounciation	(	self,
		word_form
	)

Extracts the pronounciation data from the wordForm.

Parameters

word_form The wordForm data, which can be a list or a dictionary.

Returns: The extracted sound data, or None if not found.

◆ extract_word()

src.data_processing.DataProcessor.extract_word	(	self,
		lemma_datas
	)

Extracts the 'val' from the lemma element.

Parameters

lemma_datas The lemma data, which can be a list or a dictionary.

Returns: The extracted word value, or None if not found.

◆ process_data()

src.data_processing.DataProcessor.process_data	(	self,
		raw_data
	)

Processes raw data to extract and transform necessary fields for further use.

Parameters

raw_data A list of dictionaries representing raw data entries.

Returns: A list of dictionaries with processed and relevant data.

lexical_entries Represents a single lexical entry in the dictionary, containing various features of the word. Lemma:

A list of 'feat' elements that represent the main word form.
Each 'feat' contains:
- 'att': The attribute name, typically 'writtenForm' for the main word.
- 'val': The actual value, which is the word itself (e.g., '침팬지'). RelatedForm:
Represents any related forms of the word, such as variations or inflections.
Typically contains 'feat' elements with 'att' representing the type of form (e.g., 'variant') and 'val' representing the word form (e.g., '침팬치'). Sense:
Contains detailed information about the meaning(s) of the word.
Typically contains the following sub-elements:
- Equivalent: A list of 'feat' elements representing the word's equivalent forms in other languages. Each 'feat' contains:
  - 'att': The attribute name, such as 'language', 'lemma', or 'definition'.
  - 'val': The corresponding value for the attribute, such as:
    - 'language' -> The language name (e.g., '영어' for English, '프랑스어' for French).
    - 'lemma' -> The word in the equivalent language.
    - 'definition' -> The definition of the word in the equivalent language.
- SenseExample:
  - Contains example usage of the word, typically in context.
  - 'feat': A list containing:
    - 'att': 'type' represents the type of example (e.g., '문장' for sentence, '대화' for conversation).
    - 'val': The actual example text.
- att : could represent the identifier for sense.
- feat: This part contains the different specifications
  - 'att': 'type' represents different specifications of the word such as 'homonym_number', 'partOfSpeech', 'origin', 'vocabularyLevel', 'lexicalUnit'.
  - 'val': The actual value of the specification.
- val : could represent the id of the word WordForm:
Contains pronunciation information or other related word forms.
Typically contains 'feat' elements with:
- 'att': 'pronunciation' for pronunciation details.
- 'val': The value for pronunciation or another related form. att:
Can represent various attributes or features of the word, such as:
- 'id': The unique identifier for the lexical entry.
- 'partOfSpeech': The part of speech (e.g., 'noun', 'verb').
- 'origin': The origin of the word (e.g., '한자' for Hanja-based words). feat:
Contains various additional features of the word, such as:
- 'att': Could include attributes like 'homonym_number', 'partOfSpeech', 'origin', 'vocabularyLevel', 'lexicalUnit'.
- 'val': The actual value for each feature (e.g., a specific homonym number, part of speech, or origin of the word). val:
Can represent a variety of values such as the 'id' of the word, or specific properties like part of speech or lexical unit.

◆ process_hanja_data()

src.data_processing.DataProcessor.process_hanja_data	(	self,
		lines
	)

Processes Hanja file lines to extract structured data.

Groups Hanja characters, their corresponding Korean readings, and definitions. Slow because of the translation process

Parameters

lines A list of lines from the Hanja data file.

Returns: A dictionary where keys are Hanja characters and values are lists of corresponding Korean readings and definitions.

◆ read_hanja_file()

src.data_processing.DataProcessor.read_hanja_file	(	self,
		file_path
	)

Reads a Hanja data file line by line.

Parameters

file_path Path to the Hanja data file.

Returns: A list of lines read from the file.

◆ reorder_hanja_results()

src.data_processing.DataProcessor.reorder_hanja_results	(	self,
		hanja_results,
		hanja_characters
	)

Function to reorder based on the correct list.

Parameters

hanja_results	Unordered hanja list
hanja	characters Ordered hanja list

Returns: ordered_results

Member Data Documentation

◆ folder_path

src.data_processing.DataProcessor.folder_path

The documentation for this class was generated from the following file:

src/data_processing.py

Public Member Functions

Public Attributes

Detailed Description

Constructor & Destructor Documentation

◆ __init__()

Member Function Documentation

◆ extract_data()

◆ extract_definition()

◆ extract_equivalents()

◆ extract_hanja()

◆ extract_korean_definition()

◆ extract_language()

◆ extract_lemma()

◆ extract_pronounciation()

◆ extract_word()

◆ process_data()

◆ process_hanja_data()

◆ read_hanja_file()

◆ reorder_hanja_results()

Member Data Documentation

◆ folder_path

◆ init()