Skip to content

Modules

This project is a Data Science application built with FastAPI, designed to facilitate model training, prediction, and data processing. The application uses Poetry for dependency management and follows modular design principles to ensure scalability and maintainability.

Modules:

Name Description
configuration

Contains configuration settings for the application, including environment variables, paths, and system-specific configurations.

component

Includes reusable components and helper classes for various processes, such as data manipulation and interaction with external services.

constant

Defines constant values, enumerations, and other static data used throughout the application.

utils

Contains utility functions that perform common tasks such as reading and writing files, processing images, and other system operations.

pipeline

Implements the core stages of the data science pipeline, including data ingestion, validation, transformation, model training, and evaluation.

routes

Defines the FastAPI routes for triggering model training, making predictions, and providing health checks for the application.

Features
  • Model training and prediction workflows via FastAPI.
  • Modular and extensible design to add new features.
  • Dependency management with Poetry.
  • Built for data scientists and developers working with machine learning models in production environments.

This application is designed to help automate and streamline the workflow of training machine learning models and making predictions via an API interface, with easy-to-understand routes and clear separation of concerns between different modules.

Configuration

This module centralizes all configuration settings for the project, ensuring consistent and maintainable configuration management. It provides access to environment variables, default settings, and other configurations required across different parts of the application.

Modules:

Name Description
env_config

Handles environment-specific configurations, such as development, testing, and production settings.

app_config

Contains application-wide constants, such as API keys, logging settings, and feature flags.

db_config

Stores database connection settings, including connection strings and timeout durations.

Usage

Import the required configuration settings as needed:

Example:

from config.env_config import ENVIRONMENT, DEBUG_MODE
from config.db_config import DATABASE_URL

Features
  • Simplifies access to configuration values across the project.
  • Ensures separation of environment-specific settings from the codebase.
  • Supports overriding default settings via environment variables.
Purpose
  • Provides a centralized location for all project settings to improve maintainability.
  • Makes it easier to adapt the application to different environments or deployment scenarios.

Constants

This package defines global constants used throughout the project. Constants help in maintaining consistency and avoiding magic numbers or strings in the codebase.

Usage

Import the required constants as needed:

Example:

from constants import APP_NAME, ENVIRONMENT
from constants import STATUS_OK, STATUS_BAD_REQUEST

Purpose
  • Centralizes constant values for maintainability and reusability.
  • Reduces hard-coded values in the project.

Components

This package contains modular and reusable components used across the project. Each component is designed to encapsulate specific functionality, promoting code reuse, scalability, and maintainability.

Modules:

Name Description
api_component

Encapsulates functionality for interacting with external APIs.

Usage

Import and use the required components as needed:

Example:

from components.api_component import APIManager

Purpose
  • Organizes project functionality into self-contained, reusable components.
  • Promotes modular design principles, making the project easier to extend and maintain.
  • Ensures separation of concerns by isolating specific functionalities into dedicated modules.

Logger

This module provides centralized logging utilities for the data science pipeline. It standardizes logging practices, ensures consistency across components, and facilitates easy debugging and monitoring of the pipeline's execution, including data preprocessing, model training, evaluation, and predictions.

Functions:

Name Description
setup_logging

Configures the logging system, including log format, level, and output destinations.

get_logger

Returns a logger instance for a specific module or stage of the pipeline.

Features
  • Centralized logging configuration to maintain consistency.
  • Support for different log levels (INFO, DEBUG, WARNING, ERROR, CRITICAL).
  • Ability to write logs to files, console, or external monitoring systems.
  • Timestamped log entries for accurate tracking of events.
  • Integration with custom exception handling for detailed error reporting.
Usage

Use this module to log messages in a standardized manner across the project:

Example:

from sample_ds_project.logging import logger

logger.info("Starting the model training process...")
logger.error("An error occurred during data validation.")

Purpose
  • To provide a standardized mechanism for logging messages throughout the data science pipeline.
  • To assist in debugging by capturing detailed logs of each pipeline stage.
  • To enable seamless integration with monitoring and alerting systems.
Best Practices
  • Use appropriate log levels to categorize messages (e.g., DEBUG for detailed information, ERROR for issues).
  • Ensure logs include sufficient context, such as function names or input details, to aid debugging.
  • Regularly monitor log files for anomalies or errors in the pipeline.
Additional Notes
  • The setup_logging function can be configured to write logs to multiple destinations, such as files or cloud services.
  • The module can be extended to integrate with third-party monitoring tools like Elasticsearch, Splunk, or Datadog.

Utils

The utils module provides various utility functions for file I/O, data encoding/decoding, and directory management.

Functions:

Name Description
read_yaml

Reads a YAML file and returns its contents as a dictionary.

create_directories

Creates directories if they do not exist.

save_json

Saves data to a JSON file.

load_json

Loads JSON data from a file.

save_bin

Saves binary data to a file.

load_bin

Loads binary data from a file.

get_size

Returns the size of a file or directory in bytes.

decode_image

Decodes an image from a base64 string.

encode_image_into_base64

Encodes an image into a base64 string.

create_directories(path_to_directories, verbose=True)

create list of directories

Parameters:

Name Type Description Default
path_to_directories list

list of path of directories

required
Source code in sample_ds_project/utils/common.py
@ensure_annotations
def create_directories(path_to_directories: list, verbose: bool = True) -> None:
    """create list of directories

    Args:
        path_to_directories (list): list of path of directories
    """
    for path in path_to_directories:
        os.makedirs(path, exist_ok=True)
        if verbose:
            logger.info(f"created directory at: {path}")

decode_image(imgstring, file_name)

Decodes a base64 string into an image and saves it at the given path

Parameters:

Name Type Description Default
imgstring str

base64 string of the image

required
file_name str

path at which to save the image

required
Source code in sample_ds_project/utils/common.py
@ensure_annotations
def decode_image(imgstring: str, file_name: str) -> None:
    """
    Decodes a base64 string into an image and saves it at the given path

    Args:
        imgstring (str): base64 string of the image
        file_name (str): path at which to save the image
    """
    imgdata = base64.b64decode(imgstring)
    with open(file_name, "wb") as f:
        f.write(imgdata)
        f.close()

encode_image_into_base64(cropped_image_path)

Encodes an image file into a base64 string.

Parameters:

Name Type Description Default
cropped_image_path str

Path to the image file to be encoded.

required

Returns:

Name Type Description
bytes Any

Base64 encoded string of the image.

Source code in sample_ds_project/utils/common.py
@ensure_annotations
def encode_image_into_base64(cropped_image_path: str) -> Any:
    """
    Encodes an image file into a base64 string.

    Args:
        cropped_image_path (str): Path to the image file to be encoded.

    Returns:
        bytes: Base64 encoded string of the image.
    """
    with open(cropped_image_path, "rb") as f:
        return base64.b64encode(f.read())

get_size(path)

get size in KB

Parameters:

Name Type Description Default
path Path

path of the file

required

Returns:

Name Type Description
str str

size in KB

Source code in sample_ds_project/utils/common.py
@ensure_annotations
def get_size(path: Path) -> str:
    """get size in KB

    Args:
        path (Path): path of the file

    Returns:
        str: size in KB
    """
    size_in_kb = round(os.path.getsize(path) / 1024)
    return f"~ {size_in_kb} KB"

load_bin(path)

load binary data

Parameters:

Name Type Description Default
path Path

path to binary file

required

Returns:

Name Type Description
Any Any

object stored in the file

Source code in sample_ds_project/utils/common.py
@ensure_annotations
def load_bin(path: Path) -> Any:
    """load binary data

    Args:
        path (Path): path to binary file

    Returns:
        Any: object stored in the file
    """
    data = joblib.load(path)
    logger.info(f"binary file loaded from: {path}")
    return data

load_json(path)

load json files data

Parameters:

Name Type Description Default
path Path

path to json file

required

Returns:

Name Type Description
ConfigBox ConfigBox

data as class attributes instead of dict

Source code in sample_ds_project/utils/common.py
@ensure_annotations
def load_json(path: Path) -> ConfigBox:
    """load json files data

    Args:
        path (Path): path to json file

    Returns:
        ConfigBox: data as class attributes instead of dict
    """
    with open(path) as f:
        content = json.load(f)

    logger.info(f"json file loaded succesfully from: {path}")
    return ConfigBox(content)

read_yaml(path_to_yaml)

reads yaml file and returns

Parameters:

Name Type Description Default
path_to_yaml str

path like input

required

Raises:

Type Description
ValueError

if yaml file is empty

e

empty file

Returns:

Name Type Description
ConfigBox ConfigBox

ConfigBox type

Source code in sample_ds_project/utils/common.py
@ensure_annotations
def read_yaml(path_to_yaml: Path) -> ConfigBox:
    """reads yaml file and returns

    Args:
        path_to_yaml (str): path like input

    Raises:
        ValueError: if yaml file is empty
        e: empty file

    Returns:
        ConfigBox: ConfigBox type
    """
    try:
        with open(path_to_yaml) as yaml_file:
            content = yaml.safe_load(yaml_file)
            logger.info(f"yaml file: {path_to_yaml} loaded successfully")
            return ConfigBox(content)
    except BoxValueError as box_err:
        raise ValueError(YAML_EMPTY_ERROR) from box_err

save_bin(data, path)

save binary file

Parameters:

Name Type Description Default
data Any

data to be saved as binary

required
path Path

path to binary file

required
Source code in sample_ds_project/utils/common.py
@ensure_annotations
def save_bin(data: Any, path: Path) -> None:
    """save binary file

    Args:
        data (Any): data to be saved as binary
        path (Path): path to binary file
    """
    joblib.dump(value=data, filename=path)
    logger.info(f"binary file saved at: {path}")

save_json(path, data)

save json data

Parameters:

Name Type Description Default
path Path

path to json file

required
data dict

data to be saved in json file

required
Source code in sample_ds_project/utils/common.py
@ensure_annotations
def save_json(path: Path, data: dict) -> None:
    """save json data

    Args:
        path (Path): path to json file
        data (dict): data to be saved in json file
    """
    with open(path, "w") as f:
        json.dump(data, f, indent=4)

    logger.info(f"json file saved at: {path}")

Exceptions

This module defines custom exception classes and error-handling utilities tailored to the needs of a data science pipeline. It helps standardize error reporting, improve debugging, and provide meaningful feedback during model training, data preprocessing, and prediction processes.

Classes:

Name Description
DataValidationError

Raised when input data fails validation checks.

ModelTrainingError

Raised during errors in the model training phase, such as convergence issues or invalid configurations.

PredictionError

Raised when the prediction pipeline encounters issues, such as missing features or incompatible input formats.

PipelineExecutionError

Raised for generic errors occurring during pipeline execution.

Usage

Import and use the exceptions in various stages of the data science pipeline:

Example:

from exception import DataValidationError, ModelTrainingError

try:
    validate_data(input_data)
except DataValidationError as e:
    logger.error(f"Data validation failed: {e}")
    raise

Features
  • Custom exceptions for specific pipeline stages, ensuring meaningful error reporting.
  • Enables targeted exception handling, reducing debugging time.
  • Provides a consistent structure for error messages across the project.
Purpose
  • To define project-specific exceptions for common error scenarios in the pipeline.
  • To improve the robustness and reliability of the pipeline by enabling clear error handling.
  • To make the debugging process more intuitive by raising descriptive errors.

Examples:

  • Data Validation: Raise a DataValidationError if the input data schema is incorrect or missing required fields.
  • Model Training: Raise a ModelTrainingError if the model fails to converge due to invalid hyperparameters.
  • Prediction: Raise a PredictionError when incompatible input data is passed to the model.
Additional Notes
  • Use these exceptions in conjunction with logging to provide detailed error information.
  • Ensure that custom exceptions are raised with meaningful messages to assist in debugging and error resolution.

CustomException

Bases: HTTPException

Source code in sample_ds_project/exception/__init__.py
class CustomException(HTTPException):
    def __init__(self, status_code: int, detail: str):
        """
        Custom exception for handling API errors.

        :param status_code: The HTTP status code to return.
        :param detail: A string describing the error in detail.
        """
        super().__init__(status_code=status_code, detail=detail)

__init__(status_code, detail)

Custom exception for handling API errors.

:param status_code: The HTTP status code to return. :param detail: A string describing the error in detail.

Source code in sample_ds_project/exception/__init__.py
def __init__(self, status_code: int, detail: str):
    """
    Custom exception for handling API errors.

    :param status_code: The HTTP status code to return.
    :param detail: A string describing the error in detail.
    """
    super().__init__(status_code=status_code, detail=detail)

Entities

This module defines the core entities and data structures used throughout the data science pipeline. Entities are designed to represent the inputs, outputs, and intermediate states of the model training and prediction processes, ensuring consistency and validation across the project.

Modules:

Name Description
data_schema

Contains definitions for input data schemas, ensuring validation and compatibility with the pipeline.

model_params

Defines structures for storing model parameters, hyperparameters, and configuration settings.

prediction_result

Provides entities for representing and managing prediction outputs, including probabilities and metadata.

Usage

Import and use the required entities in your data science pipeline:

Example:

from entity.data_schema import InputSchema
from entity.model_params import ModelConfig
from entity.prediction_result import PredictionOutput

Features
  • Defines standardized data structures for inputs, outputs, and parameters.
  • Ensures validation and consistency in data passed through the pipeline.
  • Promotes maintainability and readability by using clear, reusable entities.
Purpose
  • Serves as a single source of truth for defining data structures in the pipeline.
  • Facilitates seamless integration between different stages of the pipeline, such as data ingestion, validation, model training, and prediction.
  • Improves error handling by validating data early in the process.

Examples:

  • Data Schema: Define the expected input structure for data preprocessing.
  • Model Parameters: Store configurations like learning rate, batch size, and optimizer type.
  • Prediction Results: Represent the model's outputs in a structured format, including predicted classes, probabilities, and confidence scores.

Pipeline

The pipeline module orchestrates the end-to-end flow of the data science process, from raw data ingestion to final predictions. It consists of multiple submodules, each responsible for a specific stage in the pipeline. This modular structure ensures scalability, reusability, and ease of maintenance. The pipeline is designed to handle data preprocessing, model training, evaluation, and predictions in a systematic and automated manner.

Modules:

Name Description
Data-Ingestion

Collects and ingests raw data from various sources, performs basic checks, and stores it in a structured format.

Data-Validation

Validates ingested data for correctness, completeness, and consistency, ensuring it meets predefined quality standards.

Data-Transformation

Transforms validated data into a format suitable for model training, including feature engineering, scaling, encoding, and preprocessing.

Model-Training

Trains machine learning models using transformed data, supports hyperparameter tuning, saving trained models, and logging metrics.

Model-Evaluation

Evaluates trained models on a validation or test dataset, providing detailed performance metrics and insights.

Prediction

Uses trained models to make predictions on new or unseen data, including batch or real-time inference and post-processing of predictions.

Features
  • Modular architecture for each pipeline stage, ensuring maintainability and reusability.
  • Support for extensive logging and error handling at each stage.
  • Flexibility to customize and extend pipeline stages as needed.
  • Compatibility with various data formats and storage systems.
Usage

Import and use specific pipeline stages or run the entire pipeline end-to-end:

Example:

from pipeline.stage_01_data_ingestion import DataIngestionTrainingPipeline
from pipeline.stage_04_model_trainer import ModelTrainingPipeline

# Perform data ingestion
data_ingestion = DataIngestionTrainingPipeline(config)
raw_data = data_ingestion.run()

# Train the model
model_trainer = ModelTrainingPipeline(config, raw_data)
trained_model = model_trainer.run()

Purpose
  • To streamline the execution of a data science workflow, reducing manual intervention.
  • To ensure consistency and traceability of processes across multiple runs.
  • To provide reusable components for different machine learning projects.

Routes

The routes module defines the API routes that enable interaction with the machine learning model. This module contains endpoints for initiating model training and making predictions on new data.

Endpoints
  • POST /train-model: This endpoint is responsible for triggering the model training process.
  • POST /predict: This endpoint is responsible for generating predictions using the trained model.
Features
  • API endpoints for model training and prediction.
  • Flexible and easy-to-extend with additional routes.
  • Integration with the model training pipeline and prediction modules.
  • Handles input validation and error responses for robustness.