combiners package#
Submodules#
combiners.combiner module#
- class combiners.combiner. CSVCombineTask ( * args , ** kwargs )#
-
Bases:
CatalystTask- _abc_impl = <_abc._abc_data object> #
- _namespace_at_class_time = '' #
- _read () DataFrame #
- _read_and_combine ( files_to_combine : list [ pathlib.Path ] ) DataFrame #
-
Read in and combine all files in the list.
- Parameters:
-
files_to_combine (list[Path]) – The list of files to combine.
- Returns:
-
The combined dataframe.
- Return type:
-
pd.DataFrame
- _write ( df : DataFrame )#
- ingestor_name = <luigi.parameter.Parameter object> #
- output ()#
-
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single
Targetor a list ofTargetinstances.- Implementation note
-
If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.
See Task.output
- requires ()#
-
The Tasks that this Task depends on.
A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.
See Task.requires
- combiners.combiner. log_assertion ( assertion : bool , message : str , log_obj : Logger )#
-
Utility to log an assertion and raise an AssertionError if the assertion is False.
- Parameters:
-
-
assertion (bool) – The assertion to check.
-
message (str) – The message to log.
-
log_obj (logging.Logger) – The logger to use.
-
- Raises:
-
err – AssertionError
combiners.generic module#
- class combiners.generic. GenericMultiCSVIngestionTask ( * args , ** kwargs )#
-
Bases:
GenericCSVIngestionTask,ABCGeneric CSV ingestion task that reads in multiple CSV files and writes them to a single CSV file.
- _abc_impl = <_abc._abc_data object> #
- _namespace_at_class_time = '' #
- _read () DataFrame #
- requires ()#
-
The Tasks that this Task depends on.
A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.
See Task.requires
combiners.load_file module#
- class combiners.load_file. LoadFileTask ( * args , ** kwargs )#
-
Bases:
CatalystTask- _abc_impl = <_abc._abc_data object> #
- _handle_bad_line ( line : list [ str ] ) list [ str ] | None #
-
Handle bad lines in the CSV file.
- Parameters:
-
line (list[str]) – The line to handle, split by delimiter.
- Returns:
-
The corrected line, or None if the line should be dropped.
- Return type:
-
list[str] | None
- _namespace_at_class_time = '' #
- _read () DataFrame #
- _write ( df : DataFrame )#
- csv_path = <luigi.parameter.Parameter object> #
- ingestor_name = <luigi.parameter.Parameter object> #
- output ()#
-
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single
Targetor a list ofTargetinstances.- Implementation note
-
If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.
See Task.output