Modules
This project is a Data Science application built with FastAPI, designed to facilitate model training, prediction, and data processing. The application uses Poetry for dependency management and follows modular design principles to ensure scalability and maintainability.
Modules:
Name | Description |
---|---|
configuration | Contains configuration settings for the application, including environment variables, paths, and system-specific configurations. |
component | Includes reusable components and helper classes for various processes, such as data manipulation and interaction with external services. |
constant | Defines constant values, enumerations, and other static data used throughout the application. |
utils | Contains utility functions that perform common tasks such as reading and writing files, processing images, and other system operations. |
pipeline | Implements the core stages of the data science pipeline, including data ingestion, validation, transformation, model training, and evaluation. |
routes | Defines the FastAPI routes for triggering model training, making predictions, and providing health checks for the application. |
Features
- Model training and prediction workflows via FastAPI.
- Modular and extensible design to add new features.
- Dependency management with Poetry.
- Built for data scientists and developers working with machine learning models in production environments.
This application is designed to help automate and streamline the workflow of training machine learning models and making predictions via an API interface, with easy-to-understand routes and clear separation of concerns between different modules.
Configuration¶
This module centralizes all configuration settings for the project, ensuring consistent and maintainable configuration management. It provides access to environment variables, default settings, and other configurations required across different parts of the application.
Modules:
Name | Description |
---|---|
env_config | Handles environment-specific configurations, such as development, testing, and production settings. |
app_config | Contains application-wide constants, such as API keys, logging settings, and feature flags. |
db_config | Stores database connection settings, including connection strings and timeout durations. |
Usage
Import the required configuration settings as needed:
Example:
Features
- Simplifies access to configuration values across the project.
- Ensures separation of environment-specific settings from the codebase.
- Supports overriding default settings via environment variables.
Purpose
- Provides a centralized location for all project settings to improve maintainability.
- Makes it easier to adapt the application to different environments or deployment scenarios.
Constants¶
This package defines global constants used throughout the project. Constants help in maintaining consistency and avoiding magic numbers or strings in the codebase.
Usage
Import the required constants as needed:
Example:
Purpose
- Centralizes constant values for maintainability and reusability.
- Reduces hard-coded values in the project.
Components¶
This package contains modular and reusable components used across the project. Each component is designed to encapsulate specific functionality, promoting code reuse, scalability, and maintainability.
Modules:
Name | Description |
---|---|
api_component | Encapsulates functionality for interacting with external APIs. |
Usage
Import and use the required components as needed:
Example:
Purpose
- Organizes project functionality into self-contained, reusable components.
- Promotes modular design principles, making the project easier to extend and maintain.
- Ensures separation of concerns by isolating specific functionalities into dedicated modules.
Logger¶
This module provides centralized logging utilities for the data science pipeline. It standardizes logging practices, ensures consistency across components, and facilitates easy debugging and monitoring of the pipeline's execution, including data preprocessing, model training, evaluation, and predictions.
Functions:
Name | Description |
---|---|
setup_logging | Configures the logging system, including log format, level, and output destinations. |
get_logger | Returns a logger instance for a specific module or stage of the pipeline. |
Features
- Centralized logging configuration to maintain consistency.
- Support for different log levels (INFO, DEBUG, WARNING, ERROR, CRITICAL).
- Ability to write logs to files, console, or external monitoring systems.
- Timestamped log entries for accurate tracking of events.
- Integration with custom exception handling for detailed error reporting.
Usage
Use this module to log messages in a standardized manner across the project:
Example:
Purpose
- To provide a standardized mechanism for logging messages throughout the data science pipeline.
- To assist in debugging by capturing detailed logs of each pipeline stage.
- To enable seamless integration with monitoring and alerting systems.
Best Practices
- Use appropriate log levels to categorize messages (e.g., DEBUG for detailed information, ERROR for issues).
- Ensure logs include sufficient context, such as function names or input details, to aid debugging.
- Regularly monitor log files for anomalies or errors in the pipeline.
Additional Notes
- The
setup_logging
function can be configured to write logs to multiple destinations, such as files or cloud services. - The module can be extended to integrate with third-party monitoring tools like Elasticsearch, Splunk, or Datadog.
Utils¶
The utils
module provides various utility functions for file I/O, data encoding/decoding, and directory management.
Functions:
Name | Description |
---|---|
read_yaml | Reads a YAML file and returns its contents as a dictionary. |
create_directories | Creates directories if they do not exist. |
save_json | Saves data to a JSON file. |
load_json | Loads JSON data from a file. |
save_bin | Saves binary data to a file. |
load_bin | Loads binary data from a file. |
get_size | Returns the size of a file or directory in bytes. |
decode_image | Decodes an image from a base64 string. |
encode_image_into_base64 | Encodes an image into a base64 string. |
create_directories(path_to_directories, verbose=True)
¶
create list of directories
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_to_directories | list | list of path of directories | required |
Source code in sample_ds_project/utils/common.py
decode_image(imgstring, file_name)
¶
Decodes a base64 string into an image and saves it at the given path
Parameters:
Name | Type | Description | Default |
---|---|---|---|
imgstring | str | base64 string of the image | required |
file_name | str | path at which to save the image | required |
Source code in sample_ds_project/utils/common.py
encode_image_into_base64(cropped_image_path)
¶
Encodes an image file into a base64 string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cropped_image_path | str | Path to the image file to be encoded. | required |
Returns:
Name | Type | Description |
---|---|---|
bytes | Any | Base64 encoded string of the image. |
Source code in sample_ds_project/utils/common.py
get_size(path)
¶
get size in KB
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path | Path | path of the file | required |
Returns:
Name | Type | Description |
---|---|---|
str | str | size in KB |
Source code in sample_ds_project/utils/common.py
load_bin(path)
¶
load binary data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path | Path | path to binary file | required |
Returns:
Name | Type | Description |
---|---|---|
Any | Any | object stored in the file |
Source code in sample_ds_project/utils/common.py
load_json(path)
¶
load json files data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path | Path | path to json file | required |
Returns:
Name | Type | Description |
---|---|---|
ConfigBox | ConfigBox | data as class attributes instead of dict |
Source code in sample_ds_project/utils/common.py
read_yaml(path_to_yaml)
¶
reads yaml file and returns
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_to_yaml | str | path like input | required |
Raises:
Type | Description |
---|---|
ValueError | if yaml file is empty |
e | empty file |
Returns:
Name | Type | Description |
---|---|---|
ConfigBox | ConfigBox | ConfigBox type |
Source code in sample_ds_project/utils/common.py
save_bin(data, path)
¶
save binary file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data | Any | data to be saved as binary | required |
path | Path | path to binary file | required |
Source code in sample_ds_project/utils/common.py
save_json(path, data)
¶
save json data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path | Path | path to json file | required |
data | dict | data to be saved in json file | required |
Source code in sample_ds_project/utils/common.py
Exceptions¶
This module defines custom exception classes and error-handling utilities tailored to the needs of a data science pipeline. It helps standardize error reporting, improve debugging, and provide meaningful feedback during model training, data preprocessing, and prediction processes.
Classes:
Name | Description |
---|---|
DataValidationError | Raised when input data fails validation checks. |
ModelTrainingError | Raised during errors in the model training phase, such as convergence issues or invalid configurations. |
PredictionError | Raised when the prediction pipeline encounters issues, such as missing features or incompatible input formats. |
PipelineExecutionError | Raised for generic errors occurring during pipeline execution. |
Usage
Import and use the exceptions in various stages of the data science pipeline:
Example:
Features
- Custom exceptions for specific pipeline stages, ensuring meaningful error reporting.
- Enables targeted exception handling, reducing debugging time.
- Provides a consistent structure for error messages across the project.
Purpose
- To define project-specific exceptions for common error scenarios in the pipeline.
- To improve the robustness and reliability of the pipeline by enabling clear error handling.
- To make the debugging process more intuitive by raising descriptive errors.
Examples:
- Data Validation: Raise a
DataValidationError
if the input data schema is incorrect or missing required fields. - Model Training: Raise a
ModelTrainingError
if the model fails to converge due to invalid hyperparameters. - Prediction: Raise a
PredictionError
when incompatible input data is passed to the model.
Additional Notes
- Use these exceptions in conjunction with logging to provide detailed error information.
- Ensure that custom exceptions are raised with meaningful messages to assist in debugging and error resolution.
CustomException
¶
Bases: HTTPException
Source code in sample_ds_project/exception/__init__.py
__init__(status_code, detail)
¶
Custom exception for handling API errors.
:param status_code: The HTTP status code to return. :param detail: A string describing the error in detail.
Source code in sample_ds_project/exception/__init__.py
Entities¶
This module defines the core entities and data structures used throughout the data science pipeline. Entities are designed to represent the inputs, outputs, and intermediate states of the model training and prediction processes, ensuring consistency and validation across the project.
Modules:
Name | Description |
---|---|
data_schema | Contains definitions for input data schemas, ensuring validation and compatibility with the pipeline. |
model_params | Defines structures for storing model parameters, hyperparameters, and configuration settings. |
prediction_result | Provides entities for representing and managing prediction outputs, including probabilities and metadata. |
Usage
Import and use the required entities in your data science pipeline:
Example:
Features
- Defines standardized data structures for inputs, outputs, and parameters.
- Ensures validation and consistency in data passed through the pipeline.
- Promotes maintainability and readability by using clear, reusable entities.
Purpose
- Serves as a single source of truth for defining data structures in the pipeline.
- Facilitates seamless integration between different stages of the pipeline, such as data ingestion, validation, model training, and prediction.
- Improves error handling by validating data early in the process.
Examples:
- Data Schema: Define the expected input structure for data preprocessing.
- Model Parameters: Store configurations like learning rate, batch size, and optimizer type.
- Prediction Results: Represent the model's outputs in a structured format, including predicted classes, probabilities, and confidence scores.
Pipeline¶
The pipeline
module orchestrates the end-to-end flow of the data science process, from raw data ingestion to final predictions. It consists of multiple submodules, each responsible for a specific stage in the pipeline. This modular structure ensures scalability, reusability, and ease of maintenance. The pipeline is designed to handle data preprocessing, model training, evaluation, and predictions in a systematic and automated manner.
Modules:
Name | Description |
---|---|
Data-Ingestion | Collects and ingests raw data from various sources, performs basic checks, and stores it in a structured format. |
Data-Validation | Validates ingested data for correctness, completeness, and consistency, ensuring it meets predefined quality standards. |
Data-Transformation | Transforms validated data into a format suitable for model training, including feature engineering, scaling, encoding, and preprocessing. |
Model-Training | Trains machine learning models using transformed data, supports hyperparameter tuning, saving trained models, and logging metrics. |
Model-Evaluation | Evaluates trained models on a validation or test dataset, providing detailed performance metrics and insights. |
Prediction | Uses trained models to make predictions on new or unseen data, including batch or real-time inference and post-processing of predictions. |
Features
- Modular architecture for each pipeline stage, ensuring maintainability and reusability.
- Support for extensive logging and error handling at each stage.
- Flexibility to customize and extend pipeline stages as needed.
- Compatibility with various data formats and storage systems.
Usage
Import and use specific pipeline stages or run the entire pipeline end-to-end:
Example:
from pipeline.stage_01_data_ingestion import DataIngestionTrainingPipeline
from pipeline.stage_04_model_trainer import ModelTrainingPipeline
# Perform data ingestion
data_ingestion = DataIngestionTrainingPipeline(config)
raw_data = data_ingestion.run()
# Train the model
model_trainer = ModelTrainingPipeline(config, raw_data)
trained_model = model_trainer.run()
Purpose
- To streamline the execution of a data science workflow, reducing manual intervention.
- To ensure consistency and traceability of processes across multiple runs.
- To provide reusable components for different machine learning projects.
Routes¶
The routes
module defines the API routes that enable interaction with the machine learning model. This module contains endpoints for initiating model training and making predictions on new data.
Endpoints
- POST /train-model: This endpoint is responsible for triggering the model training process.
- POST /predict: This endpoint is responsible for generating predictions using the trained model.
Features
- API endpoints for model training and prediction.
- Flexible and easy-to-extend with additional routes.
- Integration with the model training pipeline and prediction modules.
- Handles input validation and error responses for robustness.