Modules

This project is a Data Science application built with FastAPI, designed to facilitate model training, prediction, and data processing. The application uses Poetry for dependency management and follows modular design principles to ensure scalability and maintainability.

Modules:

Name	Description
`configuration`	Contains configuration settings for the application, including environment variables, paths, and system-specific configurations.
`component`	Includes reusable components and helper classes for various processes, such as data manipulation and interaction with external services.
`constant`	Defines constant values, enumerations, and other static data used throughout the application.
`utils`	Contains utility functions that perform common tasks such as reading and writing files, processing images, and other system operations.
`pipeline`	Implements the core stages of the data science pipeline, including data ingestion, validation, transformation, model training, and evaluation.
`routes`	Defines the FastAPI routes for triggering model training, making predictions, and providing health checks for the application.

Features

Model training and prediction workflows via FastAPI.
Modular and extensible design to add new features.
Dependency management with Poetry.
Built for data scientists and developers working with machine learning models in production environments.

This application is designed to help automate and streamline the workflow of training machine learning models and making predictions via an API interface, with easy-to-understand routes and clear separation of concerns between different modules.

Configuration¶

This module centralizes all configuration settings for the project, ensuring consistent and maintainable configuration management. It provides access to environment variables, default settings, and other configurations required across different parts of the application.

Modules:

Name	Description
`env_config`	Handles environment-specific configurations, such as development, testing, and production settings.
`app_config`	Contains application-wide constants, such as API keys, logging settings, and feature flags.
`db_config`	Stores database connection settings, including connection strings and timeout durations.

Usage

Import the required configuration settings as needed:

Example:

from config.env_config import ENVIRONMENT, DEBUG_MODE
from config.db_config import DATABASE_URL

Features

Simplifies access to configuration values across the project.
Ensures separation of environment-specific settings from the codebase.
Supports overriding default settings via environment variables.

Purpose

Provides a centralized location for all project settings to improve maintainability.
Makes it easier to adapt the application to different environments or deployment scenarios.

Constants¶

This package defines global constants used throughout the project. Constants help in maintaining consistency and avoiding magic numbers or strings in the codebase.

Usage

Import the required constants as needed:

Example:

from constants import APP_NAME, ENVIRONMENT
from constants import STATUS_OK, STATUS_BAD_REQUEST

Purpose

Centralizes constant values for maintainability and reusability.
Reduces hard-coded values in the project.

Components¶

This package contains modular and reusable components used across the project. Each component is designed to encapsulate specific functionality, promoting code reuse, scalability, and maintainability.

Modules:

Name	Description
`api_component`	Encapsulates functionality for interacting with external APIs.

Usage

Import and use the required components as needed:

Example:

from components.api_component import APIManager

Purpose

Organizes project functionality into self-contained, reusable components.
Promotes modular design principles, making the project easier to extend and maintain.
Ensures separation of concerns by isolating specific functionalities into dedicated modules.

Logger¶

This module provides centralized logging utilities for the data science pipeline. It standardizes logging practices, ensures consistency across components, and facilitates easy debugging and monitoring of the pipeline's execution, including data preprocessing, model training, evaluation, and predictions.

Functions:

Name	Description
`setup_logging`	Configures the logging system, including log format, level, and output destinations.
`get_logger`	Returns a logger instance for a specific module or stage of the pipeline.

Features

Centralized logging configuration to maintain consistency.
Support for different log levels (INFO, DEBUG, WARNING, ERROR, CRITICAL).
Ability to write logs to files, console, or external monitoring systems.
Timestamped log entries for accurate tracking of events.
Integration with custom exception handling for detailed error reporting.

Usage

Use this module to log messages in a standardized manner across the project:

Example:

from sample_ds_project.logging import logger

logger.info("Starting the model training process...")
logger.error("An error occurred during data validation.")

Purpose

To provide a standardized mechanism for logging messages throughout the data science pipeline.
To assist in debugging by capturing detailed logs of each pipeline stage.
To enable seamless integration with monitoring and alerting systems.

Best Practices

Use appropriate log levels to categorize messages (e.g., DEBUG for detailed information, ERROR for issues).
Ensure logs include sufficient context, such as function names or input details, to aid debugging.
Regularly monitor log files for anomalies or errors in the pipeline.

Additional Notes

The setup_logging function can be configured to write logs to multiple destinations, such as files or cloud services.
The module can be extended to integrate with third-party monitoring tools like Elasticsearch, Splunk, or Datadog.

Utils¶

The utils module provides various utility functions for file I/O, data encoding/decoding, and directory management.

Functions:

Name	Description
`read_yaml`	Reads a YAML file and returns its contents as a dictionary.
`create_directories`	Creates directories if they do not exist.
`save_json`	Saves data to a JSON file.
`load_json`	Loads JSON data from a file.
`save_bin`	Saves binary data to a file.
`load_bin`	Loads binary data from a file.
`get_size`	Returns the size of a file or directory in bytes.
`decode_image`	Decodes an image from a base64 string.
`encode_image_into_base64`	Encodes an image into a base64 string.

`create_directories(path_to_directories, verbose=True)` ¶

create list of directories

Parameters:

Name	Type	Description	Default
`path_to_directories`	`list`	list of path of directories	required

Source code in sample_ds_project/utils/common.py

@ensure_annotations
def create_directories(path_to_directories: list, verbose: bool = True) -> None:
    """create list of directories

    Args:
        path_to_directories (list): list of path of directories
    """
    for path in path_to_directories:
        os.makedirs(path, exist_ok=True)
        if verbose:
            logger.info(f"created directory at: {path}")

`decode_image(imgstring, file_name)` ¶

Decodes a base64 string into an image and saves it at the given path

Parameters:

Name	Type	Description	Default
`imgstring`	`str`	base64 string of the image	required
`file_name`	`str`	path at which to save the image	required

Source code in sample_ds_project/utils/common.py

@ensure_annotations
def decode_image(imgstring: str, file_name: str) -> None:
    """
    Decodes a base64 string into an image and saves it at the given path

    Args:
        imgstring (str): base64 string of the image
        file_name (str): path at which to save the image
    """
    imgdata = base64.b64decode(imgstring)
    with open(file_name, "wb") as f:
        f.write(imgdata)
        f.close()

`encode_image_into_base64(cropped_image_path)` ¶

Encodes an image file into a base64 string.

Parameters:

Name	Type	Description	Default
`cropped_image_path`	`str`	Path to the image file to be encoded.	required

Returns:

Name	Type	Description
`bytes`	`Any`	Base64 encoded string of the image.

Source code in sample_ds_project/utils/common.py

@ensure_annotations
def encode_image_into_base64(cropped_image_path: str) -> Any:
    """
    Encodes an image file into a base64 string.

    Args:
        cropped_image_path (str): Path to the image file to be encoded.

    Returns:
        bytes: Base64 encoded string of the image.
    """
    with open(cropped_image_path, "rb") as f:
        return base64.b64encode(f.read())

`get_size(path)` ¶

get size in KB

Parameters:

Name	Type	Description	Default
`path`	`Path`	path of the file	required

Returns:

Name	Type	Description
`str`	`str`	size in KB

Source code in sample_ds_project/utils/common.py

@ensure_annotations
def get_size(path: Path) -> str:
    """get size in KB

    Args:
        path (Path): path of the file

    Returns:
        str: size in KB
    """
    size_in_kb = round(os.path.getsize(path) / 1024)
    return f"~ {size_in_kb} KB"

`load_bin(path)` ¶

load binary data

Parameters:

Name	Type	Description	Default
`path`	`Path`	path to binary file	required

Returns:

Name	Type	Description
`Any`	`Any`	object stored in the file

Source code in sample_ds_project/utils/common.py

@ensure_annotations
def load_bin(path: Path) -> Any:
    """load binary data

    Args:
        path (Path): path to binary file

    Returns:
        Any: object stored in the file
    """
    data = joblib.load(path)
    logger.info(f"binary file loaded from: {path}")
    return data

`load_json(path)` ¶

load json files data

Parameters:

Name	Type	Description	Default
`path`	`Path`	path to json file	required

Returns:

Name	Type	Description
`ConfigBox`	`ConfigBox`	data as class attributes instead of dict

Source code in sample_ds_project/utils/common.py

@ensure_annotations
def load_json(path: Path) -> ConfigBox:
    """load json files data

    Args:
        path (Path): path to json file

    Returns:
        ConfigBox: data as class attributes instead of dict
    """
    with open(path) as f:
        content = json.load(f)

    logger.info(f"json file loaded succesfully from: {path}")
    return ConfigBox(content)

`read_yaml(path_to_yaml)` ¶

reads yaml file and returns

Parameters:

Name	Type	Description	Default
`path_to_yaml`	`str`	path like input	required

Raises:

Type	Description
`ValueError`	if yaml file is empty
`e`	empty file

Returns:

Name	Type	Description
`ConfigBox`	`ConfigBox`	ConfigBox type

Source code in sample_ds_project/utils/common.py

@ensure_annotations
def read_yaml(path_to_yaml: Path) -> ConfigBox:
    """reads yaml file and returns

    Args:
        path_to_yaml (str): path like input

    Raises:
        ValueError: if yaml file is empty
        e: empty file

    Returns:
        ConfigBox: ConfigBox type
    """
    try:
        with open(path_to_yaml) as yaml_file:
            content = yaml.safe_load(yaml_file)
            logger.info(f"yaml file: {path_to_yaml} loaded successfully")
            return ConfigBox(content)
    except BoxValueError as box_err:
        raise ValueError(YAML_EMPTY_ERROR) from box_err

`save_bin(data, path)` ¶

save binary file

Parameters:

Name	Type	Description	Default
`data`	`Any`	data to be saved as binary	required
`path`	`Path`	path to binary file	required

Source code in sample_ds_project/utils/common.py

@ensure_annotations
def save_bin(data: Any, path: Path) -> None:
    """save binary file

    Args:
        data (Any): data to be saved as binary
        path (Path): path to binary file
    """
    joblib.dump(value=data, filename=path)
    logger.info(f"binary file saved at: {path}")

`save_json(path, data)` ¶

save json data

Parameters:

Name	Type	Description	Default
`path`	`Path`	path to json file	required
`data`	`dict`	data to be saved in json file	required

Source code in sample_ds_project/utils/common.py

@ensure_annotations
def save_json(path: Path, data: dict) -> None:
    """save json data

    Args:
        path (Path): path to json file
        data (dict): data to be saved in json file
    """
    with open(path, "w") as f:
        json.dump(data, f, indent=4)

    logger.info(f"json file saved at: {path}")

Exceptions¶

This module defines custom exception classes and error-handling utilities tailored to the needs of a data science pipeline. It helps standardize error reporting, improve debugging, and provide meaningful feedback during model training, data preprocessing, and prediction processes.

Classes:

Name	Description
`DataValidationError`	Raised when input data fails validation checks.
`ModelTrainingError`	Raised during errors in the model training phase, such as convergence issues or invalid configurations.
`PredictionError`	Raised when the prediction pipeline encounters issues, such as missing features or incompatible input formats.
`PipelineExecutionError`	Raised for generic errors occurring during pipeline execution.

Usage

Import and use the exceptions in various stages of the data science pipeline:

Example:

from exception import DataValidationError, ModelTrainingError

try:
    validate_data(input_data)
except DataValidationError as e:
    logger.error(f"Data validation failed: {e}")
    raise

Features

Custom exceptions for specific pipeline stages, ensuring meaningful error reporting.
Enables targeted exception handling, reducing debugging time.
Provides a consistent structure for error messages across the project.

Purpose

To define project-specific exceptions for common error scenarios in the pipeline.
To improve the robustness and reliability of the pipeline by enabling clear error handling.
To make the debugging process more intuitive by raising descriptive errors.

Examples:

Data Validation: Raise a DataValidationError if the input data schema is incorrect or missing required fields.
Model Training: Raise a ModelTrainingError if the model fails to converge due to invalid hyperparameters.
Prediction: Raise a PredictionError when incompatible input data is passed to the model.

Additional Notes

Use these exceptions in conjunction with logging to provide detailed error information.
Ensure that custom exceptions are raised with meaningful messages to assist in debugging and error resolution.

`CustomException` ¶

Bases: HTTPException

Source code in sample_ds_project/exception/__init__.py

class CustomException(HTTPException):
    def __init__(self, status_code: int, detail: str):
        """
        Custom exception for handling API errors.

        :param status_code: The HTTP status code to return.
        :param detail: A string describing the error in detail.
        """
        super().__init__(status_code=status_code, detail=detail)

`init(status_code, detail)` ¶

Custom exception for handling API errors.

:param status_code: The HTTP status code to return. :param detail: A string describing the error in detail.

Source code in sample_ds_project/exception/__init__.py

def __init__(self, status_code: int, detail: str):
    """
    Custom exception for handling API errors.

    :param status_code: The HTTP status code to return.
    :param detail: A string describing the error in detail.
    """
    super().__init__(status_code=status_code, detail=detail)

Entities¶

This module defines the core entities and data structures used throughout the data science pipeline. Entities are designed to represent the inputs, outputs, and intermediate states of the model training and prediction processes, ensuring consistency and validation across the project.

Modules:

Name	Description
`data_schema`	Contains definitions for input data schemas, ensuring validation and compatibility with the pipeline.
`model_params`	Defines structures for storing model parameters, hyperparameters, and configuration settings.
`prediction_result`	Provides entities for representing and managing prediction outputs, including probabilities and metadata.

Usage

Import and use the required entities in your data science pipeline:

Example:

from entity.data_schema import InputSchema
from entity.model_params import ModelConfig
from entity.prediction_result import PredictionOutput

Features

Defines standardized data structures for inputs, outputs, and parameters.
Ensures validation and consistency in data passed through the pipeline.
Promotes maintainability and readability by using clear, reusable entities.

Purpose

Serves as a single source of truth for defining data structures in the pipeline.
Facilitates seamless integration between different stages of the pipeline, such as data ingestion, validation, model training, and prediction.
Improves error handling by validating data early in the process.

Examples:

Data Schema: Define the expected input structure for data preprocessing.
Model Parameters: Store configurations like learning rate, batch size, and optimizer type.
Prediction Results: Represent the model's outputs in a structured format, including predicted classes, probabilities, and confidence scores.

Pipeline¶

The pipeline module orchestrates the end-to-end flow of the data science process, from raw data ingestion to final predictions. It consists of multiple submodules, each responsible for a specific stage in the pipeline. This modular structure ensures scalability, reusability, and ease of maintenance. The pipeline is designed to handle data preprocessing, model training, evaluation, and predictions in a systematic and automated manner.

Modules:

Name	Description
`Data-Ingestion`	Collects and ingests raw data from various sources, performs basic checks, and stores it in a structured format.
`Data-Validation`	Validates ingested data for correctness, completeness, and consistency, ensuring it meets predefined quality standards.
`Data-Transformation`	Transforms validated data into a format suitable for model training, including feature engineering, scaling, encoding, and preprocessing.
`Model-Training`	Trains machine learning models using transformed data, supports hyperparameter tuning, saving trained models, and logging metrics.
`Model-Evaluation`	Evaluates trained models on a validation or test dataset, providing detailed performance metrics and insights.
`Prediction`	Uses trained models to make predictions on new or unseen data, including batch or real-time inference and post-processing of predictions.

Features

Modular architecture for each pipeline stage, ensuring maintainability and reusability.
Support for extensive logging and error handling at each stage.
Flexibility to customize and extend pipeline stages as needed.
Compatibility with various data formats and storage systems.

Usage

Import and use specific pipeline stages or run the entire pipeline end-to-end:

Example:

from pipeline.stage_01_data_ingestion import DataIngestionTrainingPipeline
from pipeline.stage_04_model_trainer import ModelTrainingPipeline

# Perform data ingestion
data_ingestion = DataIngestionTrainingPipeline(config)
raw_data = data_ingestion.run()

# Train the model
model_trainer = ModelTrainingPipeline(config, raw_data)
trained_model = model_trainer.run()

Purpose

To streamline the execution of a data science workflow, reducing manual intervention.
To ensure consistency and traceability of processes across multiple runs.
To provide reusable components for different machine learning projects.

Routes¶

The routes module defines the API routes that enable interaction with the machine learning model. This module contains endpoints for initiating model training and making predictions on new data.

Endpoints

POST /train-model: This endpoint is responsible for triggering the model training process.
POST /predict: This endpoint is responsible for generating predictions using the trained model.

Features

API endpoints for model training and prediction.
Flexible and easy-to-extend with additional routes.
Integration with the model training pipeline and prediction modules.
Handles input validation and error responses for robustness.

Modules

Configuration¶

Constants¶

Components¶

Logger¶

Utils¶

create_directories(path_to_directories, verbose=True) ¶

decode_image(imgstring, file_name) ¶

encode_image_into_base64(cropped_image_path) ¶

get_size(path) ¶

load_bin(path) ¶

load_json(path) ¶

read_yaml(path_to_yaml) ¶

save_bin(data, path) ¶

save_json(path, data) ¶

Exceptions¶

CustomException ¶

__init__(status_code, detail) ¶

Entities¶

Pipeline¶

Routes¶

`create_directories(path_to_directories, verbose=True)` ¶

`decode_image(imgstring, file_name)` ¶

`encode_image_into_base64(cropped_image_path)` ¶

`get_size(path)` ¶

`load_bin(path)` ¶

`load_json(path)` ¶

`read_yaml(path_to_yaml)` ¶

`save_bin(data, path)` ¶

`save_json(path, data)` ¶

`CustomException` ¶

`init(status_code, detail)` ¶