Built-in modules
Built-in modules in ECiDA are reusable components designed to handle common data ingestion and processing tasks. They provide ready-to-use solutions that simplify pipeline development, especially when working with external data sources or services. This guide introduces each built-in module, outlining its purpose, configuration, inputs and outputs, and links to the implementation.
MinIO Data Loader
The MinIO Data Loader is a built-in module that simplifies loading data from a MinIO bucket. Instead of uploading files directly through Git, this module allows users to provide a reference path, enabling more flexible and scalable data ingestion. It can also be configured to periodically push the same reference URL, supporting continuous or trigger-based downstream processing.
Use Cases
- Loading large datasets or files without committing them to Git.
- Streamlining pipeline development by referencing external data stores.
- Improving integration with MinIO or similar S3-compatible services.
Endpoints
Inputs:
- None: This module does not consume any inputs.
Outputs:
minio-url: obj
: Anecida-s3://...
string that represents the file location in MinIO. This reference can be used by downstream modules to pull the actual data.
Configurations:
local-minio-path: string
: The full path to the file within the MinIO bucket.
The above specified path must exist and be accessible from the MinIO instance configured in the ECiDA environment. Files can be uploaded to a MinIO bucket via the MinIO Console. This interface allows you to create buckets or upload files to existing buckets. Access the console in the app by clicking “Files” from the top bar, or access manually via the URL that can be provided by you. This URL has the same structure as the app URL. For example, when the app is at app.dev.ecida.io
, files can be explored via files.dev.ecida.io
.
Credentials can be given to you to access the MinIO console. These should give you permissions to upload files. If you have not yet received your credentials, or are unable to upload files, please contact us at [email protected].
After logging in, you will be first greeted with the Object Browser where all available buckets are listed. Clicking a bucket opens it, showing all files present in the bucket. You can upload your files from this page by clicking “Upload” in the top right side of the page.
If you have permission to create buckets, you can do so via the Buckets page from the side bar. Here, all buckets are listed again, and you can create new buckets by clicking “Create Bucket” on the top right side of the page.
periodic-push: string
: Either “true” to emit the same ecida-s3:// URL every 10 seconds, or any other value (or unset) to emit the URL once and then sleep indefinitely.
Example Configuration
A typical configuration for the MinIO Data Loader would look like:
local-minio-path: my-bucket/data.csv
periodic-push: true
This configuration will generate a corresponding ecida-s3:// reference URL that is emitted every 10 seconds. Downstream modules can repeatedly pull this reference to fetch the latest version of the file.
API Caller Module
The API Caller Module is a built-in component that periodically fetches data from public APIs using configurable settings. It enables dynamic data ingestion by integrating external APIs into ECiDA pipelines. The module supports both authenticated and unauthenticated API access and includes basic error handling and logging.
Use Cases
- Fetching real-time data from public APIs (e.g., weather, finance, or public datasets).
- Integrating dynamic external data sources into analytics pipelines.
- Periodically refreshing datasets from various APIs without manual intervention.
Endpoints
Inputs:
- None: The module triggers itself periodically based on the configured interval.
Outputs:
api-response: json
: The fetched API response (typically in JSON format).
Configurations:
The following configuration fields define how the API call should be made:
base_url: url
: The base URL of the target API.endpoint: url
: The specific endpoint to be appended to the base URL.params: json
: Optional query parameters.headers: json
: Optional headers (e.g., for authentication).method: http method
: HTTP method to use (e.g.,GET
).timeout: float
: Request timeout duration (in seconds).interval: int
: Interval between API calls (in milliseconds).
Example Configuration
A configuration for fetching weather data would look like:
base_url: https://api.open-meteo.com
endpoint: /v1/forecast
params: {
"latitude": 52.098,
"longitude": 5.128,
"current_weather": true,
"hourly": "temperature_2m",
"timezone": "Europe/Amsterdam"
}
headers: {}
method: GET
timeout: 10
interval: 60000
This configuration fetches current weather data from the Open-Meteo API every 60 seconds.
Column Selector Module
The Column Selector Module is a built-in module that filters a tabular dataset by retaining or excluding user‑specified columns. It takes a pandas DataFrame as input, checks whether the requested columns exist, applies the selected strategy, and returns a new DataFrame with the original column order preserved.
Use Cases
- Reducing dataset width to only relevant features before modeling.
- Removing personally identifiable columns for privacy compliance.
- Improving performance by dropping unused fields in large tables.
Endpoints
Inputs:
dataframe: obj
: Expects a pandas DataFrame provided by an upstream module.
Outputs:
filtered-dataframe: obj
: A pandas DataFrame containing only the selected or remaining columns.
Configurations:
columns: string
: A comma-separated list of column names to apply the strategy to (e.g., id, age, salary).mode: string
: Eitherkeep
to retain only the listed columns, orremove
to exclude them from the dataset.
Example Configuration
columns: id,age,salary
mode: keep
With this configuration, an input DataFrame containing columns:
[userId, id, name, age, salary, department]
will produce an output DataFrame with columns:
[id, age, salary]
Missing Data Handler Module
The Missing Data Handler module addresses common data‑quality issues by detecting and handling null or placeholder values within specified columns of a pandas DataFrame. It uses the same strategies available in pandas, such as dropping rows, filling with mean, median, or a constant, and forward/backward filling, to ensure clean, reliable input for downstream processing.
Use Cases
- Removing incomplete records before training a machine‑learning model.
- Imputing missing numeric features with summary statistics (mean or median).
- Filling categorical or mixed‑type columns with a constant placeholder.
- Carrying forward or backward the last valid observation in time‑series data.
Endpoints
Inputs:
dataframe: obj
: Expects a pandas DataFrame provided by an upstream module.
Outputs:
cleaned-dataframe: obj
: A pandas DataFrame with missing values handled according to the chosen strategy.
Configurations:
-
columns: string
: A comma-separated list of column names to apply the strategy to (e.g. age,salary). If omitted or empty, the strategy applies to all DataFrame columns. -
strategy: string
: Specifies how missing values should be handled. Choose one of the following:drop
: Removes any row with missing values in the target columns.fill-mean
: Replaces missing numeric values with the column mean.fill-median
: Replaces missing numeric values with the column median.fill-constant
: Replaces missing values (any data type) with a user-provided constant.forward-fill
: Fills missing values by propagating the last valid value forward.backward-fill
: Fills missing values by propagating the next valid value backward.
-
constant-value: string
: Required when strategy is fill-constant; value to use for imputation.
- Column names are matched exactly (case-sensitive).
- Placeholder strings (e.g., nan, null, empty, -, missing, unknown) are normalized to actual NaN before applying the strategy.
- Non-numeric columns are ignored by 'fill-mean' and 'fill-median' strategies.
- When using the 'fill-constant' strategy, a missing or empty 'constant-value' will result in an error.
Example Configuration
strategy: fill-mean
columns: Numeric_Col, String_Col
With this configuration, an input DataFrame containing columns:
[Numeric_Col, String_Col, Mixed_Col]
With this configuration, any NaN or recognized placeholder in the Numeric_Col
column is replaced with the column’s mean, calculated from the current DataFrame. The String_Col
column is ignored because imputation is only applied to numerical data. The cleaned DataFrame is then emitted on the cleaned-dataframe
output.
Normalizer Module
The Normalizer module ensures consistent scaling of numeric features in a pandas DataFrame by applying scikit-learn’s Standard, Min–Max, or Robust scaler. Mapping all features onto a shared scale prevents high-magnitude variables from dominating downstream algorithms, delivering cleaner and more reliable inputs for model training.
Use Cases
- Scaling features to zero mean and unit variance for variance-sensitive algorithms (e.g., SVM, k-means).
- Rescaling features to a fixed range (e.g., 0,1) for neural networks or any model requiring bounded inputs.
- Reducing the impact of outliers by applying quantile-based (robust) scaling.
- Improving convergence speed and overall performance of machine learning models by enforcing a consistent feature scale.
Endpoints
Inputs:
dataframe: obj
: Expects a pd.DataFrame provided by an upstream module.
Outputs:
normalized-dataframe: obj
: A pd.DataFrame with selected numeric columns have been transformed to their normalized versions.
Configurations:
-
numeric-columns: string
: A comma-separated list of numeric column names to normalize. If left empty, the strategy is applied to all numeric columns in the DataFrame. -
strategy: string
: Selects the normalization technique. Options:standard
: Apply zero-mean, unit-variance scaling via scikit-learn’s StandardScaler.minmax
: Rescale features into a specified range via scikit-learn’s MinMaxScaler.robust
: Scale features based on a percentile range via scikit-learn’s RobustScaler.
-
range: string
: Optional comma-separated pair of numbers defining the target scaling window:- For
minmax
:, (e.g. 0,1) passed as feature_range. - For
robust
: <q_low>,<q_high> percentiles (e.g. 25,75) passed as quantile_range.
- For
- Column names are matched exactly (case-sensitive).
- Original columns are replaced by their scaled versions.
- Non-numeric columns are ignored by the module.
- If the range configuration is omitted or left blank, 'minmax' scaling defaults to 0,1 and robust scaling defaults to 25,75.
Example Configuration
strategy: robust
numeric-columns: Numeric1, Category
range: 5,95
Given an input DataFrame with columns:
[Numeric1, Numeric2, Numeric3, Category]
This configuration causes the module to apply Robust scaling only to the Numeric1
column, mapping its values into the 5th–95th percentile range, while leaving Numeric2
and Numeric3
untouched and ignoring the non-numeric Category
column. The resulting DataFrame is emitted on the normalized-dataframe
output, with Numeric1
replaced by its scaled version and all other columns preserved.
Feature Encoder Module
The Feature Encoder module converts categorical columns in a pandas DataFrame into numeric representations, supporting both integer label encoding and sparse one‑hot encoding. This ensures downstream machine learning algorithms can directly consume categorical data.
Use Cases
- Transforming string or categorical features into integer labels for tree-based models.
- Generating one-hot vectors for categorical variables in linear or neural network models.
- Preprocessing mixed-type datasets by encoding only selected categorical columns.
- Maintaining original features alongside encoded versions for analysis or debugging.
Endpoints
Inputs:
dataframe: obj
: Expects a pd.DataFrame provided by an upstream module.
Outputs:
encoded-dataframe: obj
: A pd.DataFrame in which the specified categorical columns are replaced or extended by their encoded versions.
Configurations:
categorical-columns: string
: A comma-separated list of categorical column names to encode. If omitted, all categorical columns are encoded.strategy: string
: Chooses the encoding method:label
: Use scikit-learn’s LabelEncoder to map categories to integer labels.onehot
: Use scikit-learn’s OneHotEncoder to create binary indicator columns.
replace-original: string
: If true, the original categorical columns are dropped after encoding; if false, they are retained. Defaults to true.
- Column names are matched exactly (case-sensitive).
- Non-categorical columns are ignored by the module.
Example Configuration
strategy: onehot
categorical-columns: Category1, Numeric1
replace-original: false
Given an input DataFrame with columns:
[Category1, Category2, Numeric1, Numeric2]
With this configuration, only the Category1
column is encoded (non-categorical Numeric1
is ignored), all original columns are kept, and the resulting DataFrame is emitted on the encoded-dataframe output.