API¶
Generally, each module has a submodule which shares its name that defines generic components and an image
submodule that defines components for working with images. Items in the former are imported into the module namespace, so that you can write e.g. from wildebeest.path_funcs import combine_outdir_dirname_extension
rather than from wildebeest.path_funcs.path_funcs import combine_outdir_dirname_extension
.
pipelines¶
pipelines¶
Pipeline class definitions
-
class
wildebeest.pipelines.pipelines.
Pipeline
(load_func, write_func, ops=None)[source]¶ Class for defining file processing pipelines.
-
load_func
¶ Callable that takes a string or Path object as single positional argument, reads from the corresponding location, and returns some representation of its contents.
-
write_func
¶ Callable that takes the output of the last element of ops (or the output of load_func if ops is None or empty) and a string or Path object and writes the former to the location specified by the latter.
-
ops
¶ Iterable of callables each of which takes a single positional argument. The first element of ops must accept the output of load_func, and each subsequent element must accept the output of the immediately preceding element. It is recommended that every element of ops take and return one common data structure (e.g. NumPy arrays for image data) so that those elements can be recombined easily.
-
run_report_
¶
-
property
run_report_
Pandas DataFrame of information about the most recent run.
Stores input path in the index, output path as “outpath”, Boolean indicating whether the file was skipped as “skipped”, the repr of an exception object that was handled during processing if any as “error” (np.nan if no exception was handled), and a timestamp indicating when processing completed as “time_finished”.
May include additional custom fields in a CustomReportingPipeline.
- Raises
AttributeError – If no run report is available because pipeline has not been run.
-
__call__
(inpaths, path_func, n_jobs, skip_func=None, exceptions_to_catch=<class 'Exception'>)[source]¶ Run the pipeline.
Across n_jobs threads, for each path in inpaths, if skip_func(path, path_func(path)) is True, skip that path. Otherwise, use load_func to get the resource from that path, pipe its output through ops, and write out the result with write_func.
- Parameters
inpaths (
Iterable
[Union
[Path
,str
]]) – Iterable of string or Path objects pointing to resources to be processed and written out.path_func (
Callable
[[Union
[Path
,str
]],Union
[Path
,str
]]) – Function that takes each input path and returns the desired corresponding output path.n_jobs (
int
) – Number of threads to use.skip_func (
Optional
[Callable
[[Union
[Path
,str
],Any
],bool
]]) – Callable that takes an input path and a prospective output path and returns a Boolean indicating whether or not that file should be skipped; for instance, use lambda inpath, outpath: Path(outpath).is_file() to avoid overwriting existing files.exceptions_to_catch (
Union
[Exception
,Tuple
[Exception
],None
]) – Tuple of exception types to catch. An exception of one of these types will be added to the run report, but the pipeline will continue to execute. All exceptions will be logged whether they are caught or not.
Note
Stores a run report in self.run_report_
- Return type
DataFrame
-
-
class
wildebeest.pipelines.pipelines.
CustomReportingPipeline
(load_func, write_func, ops=None)[source]¶ Class for defining file processing pipelines with custom run recording.
Differences from Pipeline parent class:
load_func, each element of ops, and write_func must each accept the string or Path object indicating the input item’s location as an additional positional argument.
Each element of ops and write_func must each accept a defaultdict(dict) object as an additional positional argument. Functions defined in Wildebeest call this item log_dict.
Inside those functions, adding items to log_dict[inpath] causes them to be added to the “run record” DataFrame that the pipeline returns.
image¶
Image-processing pipelines
load_funcs¶
load_funcs¶
Functions for loading generic data
-
wildebeest.load_funcs.load_funcs.
get_response
(url, timeout=5, **kwargs)[source]¶ Make a GET request to the specified URL and return the response.
Maintain a common session within each thread to reduce request overhead.
Note
Retry up to ten times on requests.exceptions.ConnectionError instances and 5xx status codes, waiting 2^x seconds between each retry, with a max of 10 seconds.
Log an error without retrying or raising for 403 and 404 status codes.
Raise requests.exceptions.HTTPError for other non-200 status codes.
kwargs is included only for compatibility with the CustomReportingPipeline class.
image¶
Functions for loading images
-
wildebeest.load_funcs.image.
load_image_from_url
(inpath, **kwargs)[source]¶ Download an image
kwargs is included only for compatibility with the CustomReportingPipeline class.
- Parameters
inpath (
str
) – Image URL- Return type
array
-
wildebeest.load_funcs.image.
load_image_from_disk
(inpath, **kwargs)[source]¶ Load image from disk
Assumes that image is RGB(A) if it has at least three channels.
kwargs is included only for compatibility with the CustomReportingPipeline class.
- Parameters
- Raises
ValueError – If image fails to load
- Return type
array
ops¶
helpers¶
report¶
Code for reporting information calculating during processing.
-
wildebeest.ops.helpers.report.
report_output
(func_input, func, inpath, log_dict, key)[source]¶ Add the output of a function to log_dict[inpath][key].
Return the input to that function in order to pass it along within a pipeline.
Intended to be used within a CustomReportingPipeline to add the output of the function to the run report.
Examples
>>> from functools import partial >>> >>> from wildebeest import CustomReportingPipeline >>> from wildebeest.load_funcs.image import load_image_from_url >>> from wildebeest.ops import report_output >>> from wildebeest.ops.image import calculate_mean_brightness >>> from wildebeest.path_funcs import join_outdir_filename_extension >>> from wildebeest.write_funcs.image import write_image >>> >>> report_mean_brightness = partial( >>> report_output, func=calculate_mean_brightness, key='mean_brightness' >>> ) >>> >>> report_brightness_pipeline = CustomReportingPipeline( >>> load_func=load_image_from_url, ops=[report_mean_brightness], write_func=write_image >>> ) >>> >>> image_filenames = ['2RsJ8EQ', '2TqoToT', '2VocS58', '2scKPIp', '2TsO6Pc', '2SCv0q7'] >>> image_urls = [f'https://bit.ly/{filename}' for filename in image_filenames] >>> >>> keep_filename_png_in_cwd = partial( >>> join_outdir_filename_extension, outdir='.', extension='.png' >>> ) >>> report_brightness_pipeline( >>> inpaths=image_urls, >>> path_func=keep_filename_png_in_cwd, >>> n_jobs=1, >>> skip_existing=False, >>> ) >>> print(report_brightness_pipeline.run_report_) mean_brightness ... time_finished https://bit.ly/2RsJ8EQ 78.570605 ... 1.571842e+09 https://bit.ly/2SCv0q7 130.348113 ... 1.571842e+09 https://bit.ly/2TqoToT 82.677745 ... 1.571842e+09 https://bit.ly/2TsO6Pc 151.596546 ... 1.571842e+09 https://bit.ly/2VocS58 72.072578 ... 1.571842e+09 https://bit.ly/2scKPIp 117.491313 ... 1.571842e+09
- Parameters
func_input (
Any
) – Input to funcfunc (
Callable
) – Function whose output is to be reportedinpath (
Union
[Path
,str
]) – Input path associated with func_inputlog_dict (
DefaultDict
[str
,dict
]) – Dictionary for storing function output (in log_dict[inpath][key])key (
Hashable
) – Dictionary key in which to store function output for each inpath
- Returns
func_input
- Return type
Any
Note
Assigns func(func_input) to log_dict[inpath][key]
-
wildebeest.ops.helpers.report.
get_report_output_decorator
(key)[source]¶ Get a decorator that modifies a function to add its output to log_dict[inpath][key] and return its input.
Intended to be used to adapt a function for use within a CustomReportingPipeline.
Examples
>>> from functools import partial >>> >>> from wildebeest import CustomReportingPipeline >>> from wildebeest.load_funcs.image import load_image_from_url >>> from wildebeest.ops import get_report_output_decorator >>> from wildebeest.ops.image import calculate_mean_brightness >>> from wildebeest.path_funcs import join_outdir_filename_extension >>> from wildebeest.write_funcs.image import write_image >>> >>> >>> @get_report_output_decorator(key='mean_brightness') >>> def report_mean_brightness(image): >>> return calculate_mean_brightness(image) >>> >>> >>> report_brightness_pipeline = CustomReportingPipeline( >>> load_func=load_image_from_url, ops=[report_mean_brightness], write_func=write_image >>> ) >>> >>> image_filenames = ['2RsJ8EQ', '2TqoToT', '2VocS58', '2scKPIp', '2TsO6Pc', '2SCv0q7'] >>> image_urls = [f'https://bit.ly/{filename}' for filename in image_filenames] >>> >>> keep_filename_png_in_cwd = partial( >>> join_outdir_filename_extension, outdir='.', extension='.png' >>> ) >>> report_brightness_pipeline( >>> inpaths=image_urls, >>> path_func=keep_filename_png_in_cwd, >>> n_jobs=1, >>> skip_existing=False, >>> ) >>> print(report_brightness_pipeline.run_report_) mean_brightness ... time_finished https://bit.ly/2RsJ8EQ 78.570605 ... 1.571843e+09 https://bit.ly/2SCv0q7 130.348113 ... 1.571843e+09 https://bit.ly/2TqoToT 82.677745 ... 1.571843e+09 https://bit.ly/2TsO6Pc 151.596546 ... 1.571843e+09 https://bit.ly/2VocS58 72.072578 ... 1.571843e+09 https://bit.ly/2scKPIp 117.491313 ... 1.571843e+09
image¶
stats¶
Functions that record information about an image
-
wildebeest.ops.image.stats.
calculate_mean_brightness
(image)[source]¶ Calculate mean image brightness
Brightness is calculated by converting to grayscale if necessary and then taking the mean pixel value. Assumes image is grayscale, RGB, or RGBA.
- Return type
-
wildebeest.ops.image.stats.
calculate_dhash
(image, sqrt_hash_size=8)[source]¶ Calculate difference hash of image.
As a rule of thumb, with sqrt_hash_size=8, hashes from two images should typically have a Hamming distance less than 10 if and only if those images are “duplicates”, with some robustness to sources of noise such as resizing and JPEG artifacts, where the Hamming distance between two hashes a and b is computed as bin(a ^ b).count(“1”).
Assumes image is grayscale, RGB, or RGBA.
Note
Based on Adrian Rosebrock, “Building an Image Hashing Search Engine with VP-Trees and OpenCV”, PyImageSearch, https://www.pyimagesearch.com/2019/08/26/building-an-image-hashing-search-engine-with-vp-trees-and-opencv/, accessed on 18 October 2019.
transforms¶
Functions that take an image and return a transformed image
-
wildebeest.ops.image.transforms.
resize
(image, shape=None, min_dim=None, **kwargs)[source]¶ Resize input image
shape or min_dim needs to be specified with partial before this function can be used in a Wildebeest pipeline.
kwargs is included only for compatibility with the CustomReportingPipeline class.
- Parameters
image (
array
) – NumPy array with two spatial dimensions and optionally an additional channel dimensionshape (
Optional
[Tuple
[int
,int
]]) – Desired output shape in pixels in the form (height, width)min_dim (
Optional
[int
]) – Desired minimum spatial dimension in pixels; image will be resized so that it has this length along its smaller spatial dimension while preseving aspect ratio as closely as possible. Exactly one of shape and min_dim must be None.
- Return type
array
-
wildebeest.ops.image.transforms.
centercrop
(image, reduction_factor, **kwargs)[source]¶ Crop the center out of an image
kwargs is included only for compatibility with the CustomReportingPipeline class.
- Parameters
image (
array
) – Numpy array of an image. Function will handle 2D greyscale images, RGB, and RGBA image arraysreduction_factor (
float
) – scale of center cropped box, 1.0 would be the full image value of .4 means a box of .4*width and .4*height
- Return type
array
-
wildebeest.ops.image.transforms.
trim_padding
(image, comparison_op, thresh, **kwargs)[source]¶ Remove padding from an image
Remove rows and columns on the edges of the input image where the brightness on a scale of 0 to 1 satisfies comparison_op with respect to thresh. Brightness is evaluated by converting to grayscale and normalizing if necessary. For instance, using thresh=.95 and comparison_op=operator.gt will result in removing near-white padding, while using using thresh=.05 and comparison_op=operator.lt will remove near-black padding.
kwargs is included only for compatibility with the CustomReportingPipeline class.
Assumes:
Image is grayscale, RGB, or RGBA.
Pixel values are scaled between either 0 and 1 or 0 and 255. If image is scaled between 0 and 255, then some pixel has a value greater than 1.
-
wildebeest.ops.image.transforms.
normalize_pixel_values
(image)[source]¶ Normalize image so that pixel values are between 0 and 1
Assumes pixel values are scaled between either 0 and 1 or 0 and 255.
- Return type
array
-
wildebeest.ops.image.transforms.
convert_to_grayscale
(image)[source]¶ Convert image to grayscale.
Assumes image is grayscale, RGB, or RGBA.
- Return type
array
-
wildebeest.ops.image.transforms.
flip_horiz
(image)[source]¶ Flip an image horizontally
- Return type
array
-
wildebeest.ops.image.transforms.
flip_vert
(image)[source]¶ Flip an image vertically
- Return type
array
-
wildebeest.ops.image.transforms.
rotate_90
(image)[source]¶ Rotate an image 90 degrees counterclockwise
This function takes an image as numpy array and and outputs the image rotated 90 degrees counterclockwise.
Assumes that the image is going to be rotated around center, and size of image will remain unchanged.
This function takes numpy array of an image. Function will handle 2D greyscale images, RGB, and RGBA image arrays.
- Return type
array
-
wildebeest.ops.image.transforms.
rotate_180
(image)[source]¶ Rotate an image 180 degrees
This function takes an image as numpy array and and outputs the image rotated 180 degrees.
Assumes that the image is going to be rotated around center, and size of image will remain unchanged.
This function takes numpy array of an image. Function will handle 2D greyscale images, RGB, and RGBA image arrays.
- Return type
array
-
wildebeest.ops.image.transforms.
rotate_270
(image)[source]¶ Rotate an image 270 degrees counterclockwise
This function takes an image as numpy array and and outputs the image rotated 270 degrees counterclockwise.
Assumes that the image is going to be rotated around center, and size of image will remain unchanged.
This function takes numpy array of an image. Function will handle 2D greyscale images, RGB, and RGBA image arrays.
- Return type
array
path_funcs¶
path_funcs¶
Functions that take a path and return a path
-
wildebeest.path_funcs.path_funcs.
join_outdir_filename_extension
(path, outdir, extension=None)[source]¶ Construct path by combining specified outdir, filename from path, and (optionally) specified extension
write_funcs¶
image¶
Functions that take an image and write it out
-
wildebeest.write_funcs.image.
write_image
(image, path, **kwargs)[source]¶ Write image to specified path.
Create output directory if it does not exist.
Write to a temporary in a directory “.tmp” inside the output directory and then rename the file so that we don’t create a partial image file if write process is interrupted. “.tmp” directory is not deleted, but temporary files are deleted even if there is an exception during writing or renaming.
kwargs is included only for compatibility with the CustomReportingPipeline class.
util¶
util¶
Miscellaneous utilities
-
wildebeest.util.util.
find_files_with_extensions
(search_dir, extensions)[source]¶ Find files with one of the specified extensions.
Extension matching is case-insensitive.
- Parameters
- Returns
List of Path objects specifying locations of all files recursively within search_dir that have one of the extensions in extensions.
- Return type
image¶
Miscellaneous utilities for images
-
wildebeest.util.image.
find_image_files
(search_dir: Union[pathlib.Path, str], *, extensions: Iterable[str] = ['.bmp', '.gif', '.ief', '.jpg', '.jpe', '.jpeg', '.png', '.svg', '.tiff', '.tif', '.ico', '.ras', '.pnm', '.pbm', '.pgm', '.ppm', '.rgb', '.xbm', '.xpm', '.xwd']) → List[pathlib.Path]¶ Find all image files in a directory
constants¶
Constants
-
wildebeest.constants.
PathOrStr
= typing.Union[pathlib.Path, str]¶ Path or string type