API

Generally, each module has a submodule which shares its name that defines generic components and an image submodule that defines components for working with images. Items in the former are imported into the module namespace, so that you can write e.g. from wildebeest.path_funcs import combine_outdir_dirname_extension rather than from wildebeest.path_funcs.path_funcs import combine_outdir_dirname_extension.

pipelines

pipelines

Pipeline class definitions

class wildebeest.pipelines.pipelines.Pipeline(load_func, write_func, ops=None)[source]

Class for defining file processing pipelines.

load_func

Callable that takes a string or Path object as single positional argument, reads from the corresponding location, and returns some representation of its contents.

write_func

Callable that takes the output of the last element of ops (or the output of load_func if ops is None or empty) and a string or Path object and writes the former to the location specified by the latter.

ops

Iterable of callables each of which takes a single positional argument. The first element of ops must accept the output of load_func, and each subsequent element must accept the output of the immediately preceding element. It is recommended that every element of ops take and return one common data structure (e.g. NumPy arrays for image data) so that those elements can be recombined easily.

run_report_
property run_report_

Pandas DataFrame of information about the most recent run.

Stores input path in the index, output path as “outpath”, Boolean indicating whether the file was skipped as “skipped”, the repr of an exception object that was handled during processing if any as “error” (np.nan if no exception was handled), and a timestamp indicating when processing completed as “time_finished”.

May include additional custom fields in a CustomReportingPipeline.

Raises

AttributeError – If no run report is available because pipeline has not been run.

__call__(inpaths, path_func, n_jobs, skip_func=None, exceptions_to_catch=<class 'Exception'>)[source]

Run the pipeline.

Across n_jobs threads, for each path in inpaths, if skip_func(path, path_func(path)) is True, skip that path. Otherwise, use load_func to get the resource from that path, pipe its output through ops, and write out the result with write_func.

Parameters
  • inpaths (Iterable[Union[Path, str]]) – Iterable of string or Path objects pointing to resources to be processed and written out.

  • path_func (Callable[[Union[Path, str]], Union[Path, str]]) – Function that takes each input path and returns the desired corresponding output path.

  • n_jobs (int) – Number of threads to use.

  • skip_func (Optional[Callable[[Union[Path, str], Any], bool]]) – Callable that takes an input path and a prospective output path and returns a Boolean indicating whether or not that file should be skipped; for instance, use lambda inpath, outpath: Path(outpath).is_file() to avoid overwriting existing files.

  • exceptions_to_catch (Union[Exception, Tuple[Exception], None]) – Tuple of exception types to catch. An exception of one of these types will be added to the run report, but the pipeline will continue to execute. All exceptions will be logged whether they are caught or not.

Note

Stores a run report in self.run_report_

Return type

DataFrame

class wildebeest.pipelines.pipelines.CustomReportingPipeline(load_func, write_func, ops=None)[source]

Class for defining file processing pipelines with custom run recording.

Differences from Pipeline parent class:

load_func, each element of ops, and write_func must each accept the string or Path object indicating the input item’s location as an additional positional argument.

Each element of ops and write_func must each accept a defaultdict(dict) object as an additional positional argument. Functions defined in Wildebeest call this item log_dict.

Inside those functions, adding items to log_dict[inpath] causes them to be added to the “run record” DataFrame that the pipeline returns.

image

Image-processing pipelines

class wildebeest.pipelines.image.DownloadImagePipeline(ops=None)[source]

Class for defining a pipeline that downloads images.

ops

See wildebeest.pipelines.Pipeline.

load_funcs

load_funcs

Functions for loading generic data

wildebeest.load_funcs.load_funcs.get_response(url, timeout=5, **kwargs)[source]

Make a GET request to the specified URL and return the response.

Maintain a common session within each thread to reduce request overhead.

Note

Retry up to ten times on requests.exceptions.ConnectionError instances and 5xx status codes, waiting 2^x seconds between each retry, with a max of 10 seconds.

Log an error without retrying or raising for 403 and 404 status codes.

Raise requests.exceptions.HTTPError for other non-200 status codes.

kwargs is included only for compatibility with the CustomReportingPipeline class.

Parameters
  • url (str) – URL of file to download

  • timeout (int) – Number of seconds to wait before timing out if server has not issued a response.

Return type

Response

image

Functions for loading images

wildebeest.load_funcs.image.load_image_from_url(inpath, **kwargs)[source]

Download an image

kwargs is included only for compatibility with the CustomReportingPipeline class.

Parameters

inpath (str) – Image URL

Return type

array

wildebeest.load_funcs.image.load_image_from_disk(inpath, **kwargs)[source]

Load image from disk

Assumes that image is RGB(A) if it has at least three channels.

kwargs is included only for compatibility with the CustomReportingPipeline class.

Parameters

inpath (Union[Path, str]) – Path to local image file

Raises

ValueError – If image fails to load

Return type

array

ops

helpers

report

Code for reporting information calculating during processing.

wildebeest.ops.helpers.report.report_output(func_input, func, inpath, log_dict, key)[source]

Add the output of a function to log_dict[inpath][key].

Return the input to that function in order to pass it along within a pipeline.

Intended to be used within a CustomReportingPipeline to add the output of the function to the run report.

Examples

>>> from functools import partial
>>>
>>> from wildebeest import CustomReportingPipeline
>>> from wildebeest.load_funcs.image import load_image_from_url
>>> from wildebeest.ops import report_output
>>> from wildebeest.ops.image import calculate_mean_brightness
>>> from wildebeest.path_funcs import join_outdir_filename_extension
>>> from wildebeest.write_funcs.image import write_image
>>>
>>> report_mean_brightness = partial(
>>>     report_output, func=calculate_mean_brightness, key='mean_brightness'
>>> )
>>>
>>> report_brightness_pipeline = CustomReportingPipeline(
>>>     load_func=load_image_from_url, ops=[report_mean_brightness], write_func=write_image
>>> )
>>>
>>> image_filenames = ['2RsJ8EQ', '2TqoToT', '2VocS58', '2scKPIp', '2TsO6Pc', '2SCv0q7']
>>> image_urls = [f'https://bit.ly/{filename}' for filename in image_filenames]
>>>
>>> keep_filename_png_in_cwd = partial(
>>>     join_outdir_filename_extension, outdir='.', extension='.png'
>>> )
>>> report_brightness_pipeline(
>>>     inpaths=image_urls,
>>>     path_func=keep_filename_png_in_cwd,
>>>     n_jobs=1,
>>>     skip_existing=False,
>>> )
>>> print(report_brightness_pipeline.run_report_)
                            mean_brightness  ... time_finished
https://bit.ly/2RsJ8EQ        78.570605  ...  1.571842e+09
https://bit.ly/2SCv0q7       130.348113  ...  1.571842e+09
https://bit.ly/2TqoToT        82.677745  ...  1.571842e+09
https://bit.ly/2TsO6Pc       151.596546  ...  1.571842e+09
https://bit.ly/2VocS58        72.072578  ...  1.571842e+09
https://bit.ly/2scKPIp       117.491313  ...  1.571842e+09
Parameters
  • func_input (Any) – Input to func

  • func (Callable) – Function whose output is to be reported

  • inpath (Union[Path, str]) – Input path associated with func_input

  • log_dict (DefaultDict[str, dict]) – Dictionary for storing function output (in log_dict[inpath][key])

  • key (Hashable) – Dictionary key in which to store function output for each inpath

Returns

func_input

Return type

Any

Note

Assigns func(func_input) to log_dict[inpath][key]

wildebeest.ops.helpers.report.get_report_output_decorator(key)[source]

Get a decorator that modifies a function to add its output to log_dict[inpath][key] and return its input.

Intended to be used to adapt a function for use within a CustomReportingPipeline.

Examples

>>> from functools import partial
>>>
>>> from wildebeest import CustomReportingPipeline
>>> from wildebeest.load_funcs.image import load_image_from_url
>>> from wildebeest.ops import get_report_output_decorator
>>> from wildebeest.ops.image import calculate_mean_brightness
>>> from wildebeest.path_funcs import join_outdir_filename_extension
>>> from wildebeest.write_funcs.image import write_image
>>>
>>>
>>> @get_report_output_decorator(key='mean_brightness')
>>> def report_mean_brightness(image):
>>>     return calculate_mean_brightness(image)
>>>
>>>
>>> report_brightness_pipeline = CustomReportingPipeline(
>>>     load_func=load_image_from_url, ops=[report_mean_brightness], write_func=write_image
>>> )
>>>
>>> image_filenames = ['2RsJ8EQ', '2TqoToT', '2VocS58', '2scKPIp', '2TsO6Pc', '2SCv0q7']
>>> image_urls = [f'https://bit.ly/{filename}' for filename in image_filenames]
>>>
>>> keep_filename_png_in_cwd = partial(
>>>     join_outdir_filename_extension, outdir='.', extension='.png'
>>> )
>>> report_brightness_pipeline(
>>>     inpaths=image_urls,
>>>     path_func=keep_filename_png_in_cwd,
>>>     n_jobs=1,
>>>     skip_existing=False,
>>> )
>>> print(report_brightness_pipeline.run_report_)
                            mean_brightness  ... time_finished
https://bit.ly/2RsJ8EQ        78.570605  ...  1.571843e+09
https://bit.ly/2SCv0q7       130.348113  ...  1.571843e+09
https://bit.ly/2TqoToT        82.677745  ...  1.571843e+09
https://bit.ly/2TsO6Pc       151.596546  ...  1.571843e+09
https://bit.ly/2VocS58        72.072578  ...  1.571843e+09
https://bit.ly/2scKPIp       117.491313  ...  1.571843e+09
Parameters

key (Hashable) – Dictionary key in which to store function output for each inpath

Return type

Callable

image

stats

Functions that record information about an image

wildebeest.ops.image.stats.calculate_mean_brightness(image)[source]

Calculate mean image brightness

Brightness is calculated by converting to grayscale if necessary and then taking the mean pixel value. Assumes image is grayscale, RGB, or RGBA.

Return type

float

wildebeest.ops.image.stats.calculate_dhash(image, sqrt_hash_size=8)[source]

Calculate difference hash of image.

As a rule of thumb, with sqrt_hash_size=8, hashes from two images should typically have a Hamming distance less than 10 if and only if those images are “duplicates”, with some robustness to sources of noise such as resizing and JPEG artifacts, where the Hamming distance between two hashes a and b is computed as bin(a ^ b).count(“1”).

Assumes image is grayscale, RGB, or RGBA.

Note

Based on Adrian Rosebrock, “Building an Image Hashing Search Engine with VP-Trees and OpenCV”, PyImageSearch, https://www.pyimagesearch.com/2019/08/26/building-an-image-hashing-search-engine-with-vp-trees-and-opencv/, accessed on 18 October 2019.

Parameters
  • image (array) –

  • sqrt_hash_size (int) – Side length of 2D array used to compute hash, so that hash will be up to `sqrt_hash_size`^2 bits long.

Return type

array

transforms

Functions that take an image and return a transformed image

wildebeest.ops.image.transforms.resize(image, shape=None, min_dim=None, **kwargs)[source]

Resize input image

shape or min_dim needs to be specified with partial before this function can be used in a Wildebeest pipeline.

kwargs is included only for compatibility with the CustomReportingPipeline class.

Parameters
  • image (array) – NumPy array with two spatial dimensions and optionally an additional channel dimension

  • shape (Optional[Tuple[int, int]]) – Desired output shape in pixels in the form (height, width)

  • min_dim (Optional[int]) – Desired minimum spatial dimension in pixels; image will be resized so that it has this length along its smaller spatial dimension while preseving aspect ratio as closely as possible. Exactly one of shape and min_dim must be None.

Return type

array

wildebeest.ops.image.transforms.centercrop(image, reduction_factor, **kwargs)[source]

Crop the center out of an image

kwargs is included only for compatibility with the CustomReportingPipeline class.

Parameters
  • image (array) – Numpy array of an image. Function will handle 2D greyscale images, RGB, and RGBA image arrays

  • reduction_factor (float) – scale of center cropped box, 1.0 would be the full image value of .4 means a box of .4*width and .4*height

Return type

array

wildebeest.ops.image.transforms.trim_padding(image, comparison_op, thresh, **kwargs)[source]

Remove padding from an image

Remove rows and columns on the edges of the input image where the brightness on a scale of 0 to 1 satisfies comparison_op with respect to thresh. Brightness is evaluated by converting to grayscale and normalizing if necessary. For instance, using thresh=.95 and comparison_op=operator.gt will result in removing near-white padding, while using using thresh=.05 and comparison_op=operator.lt will remove near-black padding.

kwargs is included only for compatibility with the CustomReportingPipeline class.

Assumes:

Image is grayscale, RGB, or RGBA.

Pixel values are scaled between either 0 and 1 or 0 and 255. If image is scaled between 0 and 255, then some pixel has a value greater than 1.

Parameters
  • image (array) – Numpy array of an image.

  • comparison_op (Callable) – How to compare pixel values to thresh

  • thresh (int) – Value to compare pixel values against

Return type

array

wildebeest.ops.image.transforms.normalize_pixel_values(image)[source]

Normalize image so that pixel values are between 0 and 1

Assumes pixel values are scaled between either 0 and 1 or 0 and 255.

Return type

array

wildebeest.ops.image.transforms.convert_to_grayscale(image)[source]

Convert image to grayscale.

Assumes image is grayscale, RGB, or RGBA.

Return type

array

wildebeest.ops.image.transforms.flip_horiz(image)[source]

Flip an image horizontally

Return type

array

wildebeest.ops.image.transforms.flip_vert(image)[source]

Flip an image vertically

Return type

array

wildebeest.ops.image.transforms.rotate_90(image)[source]

Rotate an image 90 degrees counterclockwise

This function takes an image as numpy array and and outputs the image rotated 90 degrees counterclockwise.

Assumes that the image is going to be rotated around center, and size of image will remain unchanged.

This function takes numpy array of an image. Function will handle 2D greyscale images, RGB, and RGBA image arrays.

Return type

array

wildebeest.ops.image.transforms.rotate_180(image)[source]

Rotate an image 180 degrees

This function takes an image as numpy array and and outputs the image rotated 180 degrees.

Assumes that the image is going to be rotated around center, and size of image will remain unchanged.

This function takes numpy array of an image. Function will handle 2D greyscale images, RGB, and RGBA image arrays.

Return type

array

wildebeest.ops.image.transforms.rotate_270(image)[source]

Rotate an image 270 degrees counterclockwise

This function takes an image as numpy array and and outputs the image rotated 270 degrees counterclockwise.

Assumes that the image is going to be rotated around center, and size of image will remain unchanged.

This function takes numpy array of an image. Function will handle 2D greyscale images, RGB, and RGBA image arrays.

Return type

array

path_funcs

path_funcs

Functions that take a path and return a path

wildebeest.path_funcs.path_funcs.join_outdir_filename_extension(path, outdir, extension=None)[source]

Construct path by combining specified outdir, filename from path, and (optionally) specified extension

Parameters
  • path (Union[Path, str]) – Path with desired filename

  • outdir (Union[Path, str]) – Desired output directory

  • extension (Optional[str]) – Desired output file extension. If None, keep extension from path.

Return type

Path

wildebeest.path_funcs.path_funcs.join_outdir_hashed_path_extension(path, outdir, extension=None)[source]

Construct path by combining specified outdir, filename derived by hashing path, and (optionally) specified extension.

Parameters
  • path (Union[Path, str]) – Input path to be hashed to generate output filename

  • outdir (Union[Path, str]) – Desired output directory

  • extension (Optional[str]) – Desired output extension. If None, keep extension from path.

Return type

Path

wildebeest.path_funcs.path_funcs.replace_dir(path, outdir)[source]

Replace the directory of path with outdir.

Parameters
Return type

Path

write_funcs

image

Functions that take an image and write it out

wildebeest.write_funcs.image.write_image(image, path, **kwargs)[source]

Write image to specified path.

Create output directory if it does not exist.

Write to a temporary in a directory “.tmp” inside the output directory and then rename the file so that we don’t create a partial image file if write process is interrupted. “.tmp” directory is not deleted, but temporary files are deleted even if there is an exception during writing or renaming.

kwargs is included only for compatibility with the CustomReportingPipeline class.

Parameters
  • image (array) – Image as a NumPy array

  • path (Union[Path, str]) – Desired output path

Return type

None

util

util

Miscellaneous utilities

wildebeest.util.util.find_files_with_extensions(search_dir, extensions)[source]

Find files with one of the specified extensions.

Extension matching is case-insensitive.

Parameters
  • search_dir (Union[Path, str]) – Directory to search

  • img_extensions – Extensions to search for. The initial “.” can be included or not.

Returns

List of Path objects specifying locations of all files recursively within search_dir that have one of the extensions in extensions.

Return type

list

image

Miscellaneous utilities for images

wildebeest.util.image.find_image_files(search_dir: Union[pathlib.Path, str], *, extensions: Iterable[str] = ['.bmp', '.gif', '.ief', '.jpg', '.jpe', '.jpeg', '.png', '.svg', '.tiff', '.tif', '.ico', '.ras', '.pnm', '.pbm', '.pgm', '.ppm', '.rgb', '.xbm', '.xpm', '.xwd']) → List[pathlib.Path]

Find all image files in a directory

constants

Constants

wildebeest.constants.PathOrStr = typing.Union[pathlib.Path, str]

Path or string type