galaxy.datatypes.dataproviders package

Submodules

galaxy.datatypes.dataproviders.base module

Base class(es) for all DataProviders.

class galaxy.datatypes.dataproviders.base.HasSettings(name, base_classes, attributes)[source]

Bases: type

Metaclass for data providers that allows defining and inheriting a dictionary named ‘settings’.

Useful for allowing class level access to expected variable types passed to class __init__ functions so they can be parsed from a query string.

class galaxy.datatypes.dataproviders.base.DataProvider(source, **kwargs)[source]

Bases: object

Base class for all data providers. Data providers:

  • have a source (which must be another file-like object)

  • implement both the iterator and context manager interfaces

  • do not allow write methods (but otherwise implement the other file object interface methods)

settings: Dict[str, str] = {}
__init__(source, **kwargs)[source]

Sets up a data provider, validates supplied source.

Parameters:

source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)

validate_source(source)[source]

Is this a valid source for this provider?

Raises:

InvalidDataProviderSource – if the source is considered invalid.

Meant to be overridden in subclasses.

truncate(size)[source]
write(string)[source]
writelines(sequence)[source]
readlines()[source]
class galaxy.datatypes.dataproviders.base.FilteredDataProvider(source, filter_fn=None, **kwargs)[source]

Bases: DataProvider

Passes each datum through a filter function and yields it if that function returns a non-None value.

Also maintains counters:
  • num_data_read: how many data have been consumed from the source.

  • num_valid_data_read: how many data have been returned from filter.

  • num_data_returned: how many data has this provider yielded.

__init__(source, filter_fn=None, **kwargs)[source]
Parameters:

filter_fn – a lambda or function that will be passed a datum and return either the (optionally modified) datum or None.

filter(datum)[source]

When given a datum from the provider’s source, return None if the datum ‘does not pass’ the filter or is invalid. Return the datum if it’s valid.

Parameters:

datum – the datum to check for validity.

Returns:

the datum, a modified datum, or None

Meant to be overridden.

settings: Dict[str, str] = {}
class galaxy.datatypes.dataproviders.base.LimitedOffsetDataProvider(source, offset=0, limit=None, **kwargs)[source]

Bases: FilteredDataProvider

A provider that uses the counters from FilteredDataProvider to limit the number of data and/or skip offset number of data before providing.

Useful for grabbing sections from a source (e.g. pagination).

settings: Dict[str, str] = {'limit': 'int', 'offset': 'int'}
__init__(source, offset=0, limit=None, **kwargs)[source]
Parameters:
  • offset – the number of data to skip before providing.

  • limit – the final number of data to provide.

class galaxy.datatypes.dataproviders.base.MultiSourceDataProvider(source_list, **kwargs)[source]

Bases: DataProvider

A provider that iterates over a list of given sources and provides data from one after another.

An iterator over iterators.

__init__(source_list, **kwargs)[source]
Parameters:

source_list – an iterator of iterables

settings: Dict[str, str] = {}

galaxy.datatypes.dataproviders.chunk module

Chunk (N number of bytes at M offset to a source’s beginning) provider.

Primarily for file sources but usable by any iterator that has both seek and read( N ).

class galaxy.datatypes.dataproviders.chunk.ChunkDataProvider(source, chunk_index=0, chunk_size=65536, **kwargs)[source]

Bases: DataProvider

Data provider that yields chunks of data from its file.

Note: this version does not account for lines and works with Binary datatypes.

MAX_CHUNK_SIZE = 65536
DEFAULT_CHUNK_SIZE = 65536
settings: Dict[str, str] = {'chunk_index': 'int', 'chunk_size': 'int'}
__init__(source, chunk_index=0, chunk_size=65536, **kwargs)[source]
Parameters:
  • chunk_index – if a source can be divided into N number of chunk_size sections, this is the index of which section to return.

  • chunk_size – how large are the desired chunks to return (gen. in bytes).

validate_source(source)[source]

Does the given source have both the methods seek and read? :raises InvalidDataProviderSource: if not.

encode(chunk)[source]

Called on the chunk before returning.

Overrride to modify, encode, or decode chunks.

class galaxy.datatypes.dataproviders.chunk.Base64ChunkDataProvider(source, chunk_index=0, chunk_size=65536, **kwargs)[source]

Bases: ChunkDataProvider

Data provider that yields chunks of base64 encoded data from its file.

encode(chunk)[source]

Return chunks encoded in base 64.

settings: Dict[str, str] = {'chunk_index': 'int', 'chunk_size': 'int'}

galaxy.datatypes.dataproviders.column module

Providers that provide lists of lists generally where each line of a source is further subdivided into multiple data (e.g. columns from a line).

class galaxy.datatypes.dataproviders.column.ColumnarDataProvider(source, indeces=None, column_count=None, column_types=None, parsers=None, parse_columns=True, deliminator='\t', filters=None, **kwargs)[source]

Bases: RegexLineDataProvider

Data provider that provide a list of columns from the lines of its source.

Columns are returned in the order given in indeces, so this provider can re-arrange columns.

If any desired index is outside the actual number of columns in the source, this provider will None-pad the output and you are guaranteed the same number of columns as the number of indeces asked for (even if they are filled with None).

settings: Dict[str, str] = {'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
__init__(source, indeces=None, column_count=None, column_types=None, parsers=None, parse_columns=True, deliminator='\t', filters=None, **kwargs)[source]
Parameters:
  • indeces (list or None) – a list of indeces of columns to gather from each row Optional: will default to None. If None, this provider will return all rows (even when a particular row contains more/less than others). If a row/line does not contain an element at a given index, the provider will-return/fill-with a None value as the element.

  • column_count (int) – an alternate means of defining indeces, use an int here to effectively provide the first N columns. Optional: will default to None.

  • column_types (list of strings) – a list of string names of types that the provider will use to look up an appropriate parser for the column. (e.g. ‘int’, ‘float’, ‘str’, ‘bool’) Optional: will default to parsing all columns as strings.

  • parsers (dictionary) – a dictionary keyed with column type strings and with values that are functions to use when parsing those types. Optional: will default to using the function _get_default_parsers.

  • parse_columns (bool) – attempt to parse columns? Optional: defaults to True.

  • deliminator (str) – character(s) used to split each row/line of the source. Optional: defaults to the tab character.

Note

that the subclass constructors are passed kwargs - so they’re params (limit, offset, etc.) are also applicable here.

parse_filter(filter_param_str)[source]
create_numeric_filter(column, op, val)[source]

Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.

The function will compare the column at index column against val using the given op where op is one of:

  • lt: less than

  • le: less than or equal to

  • eq: equal to

  • ne: not equal to

  • ge: greather than or equal to

  • gt: greater than

val is cast as float here and will return None if there’s a parsing error.

create_string_filter(column, op, val)[source]

Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.

The function will compare the column at index column against val using the given op where op is one of:

  • eq: exactly matches

  • has: the column contains the substring val

  • re: the column matches the regular expression in val

create_list_filter(column, op, val)[source]

Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.

The function will compare the column at index column against val using the given op where op is one of:

  • eq: the list val exactly matches the list in the column

  • has: the list in the column contains the sublist val

get_default_parsers()[source]

Return parser dictionary keyed for each columnar type (as defined in datatypes).

Note

primitives only by default (str, int, float, boolean, None). Other (more complex) types are retrieved as strings.

Returns:

a dictionary of the form: { <parser type name> : <function used to parse type> }

filter(line)[source]

Determines whether to provide line or not.

Parameters:

line (str) – the incoming line from the source

Returns:

a line or None

parse_columns_from_line(line)[source]

Returns a list of the desired, parsed columns. :param line: the line to parse :type line: str

parse_column_at_index(columns, parser_index, index)[source]

Get the column type for the parser from self.column_types or None if the type is unavailable.

parse_value(val, type)[source]

Attempt to parse and return the given value based on the given type.

Parameters:
  • val – the column value to parse (often a string)

  • type – the string type ‘name’ used to find the appropriate parser

Returns:

the parsed value or value if no type found in parsers or None if there was a parser error (ValueError)

get_column_type(index)[source]

Get the column type for the parser from self.column_types or None if the type is unavailable. :param index: the column index :returns: string name of type (e.g. ‘float’, ‘int’, etc.)

filter_by_columns(columns)[source]
class galaxy.datatypes.dataproviders.column.DictDataProvider(source, column_names=None, **kwargs)[source]

Bases: ColumnarDataProvider

Data provider that zips column_names and columns from the source’s contents into a dictionary.

A combination use of both column_names and indeces allows ‘picking’ key/value pairs from the source.

Note

The subclass constructors are passed kwargs - so their params (limit, offset, etc.) are also applicable here.

settings: Dict[str, str] = {'column_count': 'int', 'column_names': 'list:str', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
__init__(source, column_names=None, **kwargs)[source]
Parameters:

column_names – an ordered list of strings that will be used as the keys for each column in the returned dictionaries. The number of key, value pairs each returned dictionary has will be as short as the number of column names provided.

galaxy.datatypes.dataproviders.dataset module

galaxy.datatypes.dataproviders.decorators module

galaxy.datatypes.dataproviders.exceptions module

DataProvider related exceptions.

exception galaxy.datatypes.dataproviders.exceptions.InvalidDataProviderSource(source=None, msg='')[source]

Bases: TypeError

Raised when a unusable source is passed to a provider.

__init__(source=None, msg='')[source]
exception galaxy.datatypes.dataproviders.exceptions.NoProviderAvailable(factory_source, format_requested=None, msg='')[source]

Bases: TypeError

Raised when no provider is found for the given format_requested.

Parameters:
  • factory_source – the item that the provider was requested from

  • format_requested – the format_requested (a hashable key to access factory_source.datatypes with)

Both params are attached to this class and accessible to the try-catch receiver.

Meant to be used within a class that builds dataproviders (e.g. a Datatype)

__init__(factory_source, format_requested=None, msg='')[source]

galaxy.datatypes.dataproviders.external module

galaxy.datatypes.dataproviders.hierarchy module

galaxy.datatypes.dataproviders.line module

Dataproviders that iterate over lines from their sources.

class galaxy.datatypes.dataproviders.line.FilteredLineDataProvider(source, strip_lines=True, strip_newlines=False, provide_blank=False, comment_char='#', **kwargs)[source]

Bases: LimitedOffsetDataProvider

Data provider that yields lines of data from its source allowing optional control over which line to start on and how many lines to return.

DEFAULT_COMMENT_CHAR = '#'
settings: Dict[str, str] = {'comment_char': 'str', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
__init__(source, strip_lines=True, strip_newlines=False, provide_blank=False, comment_char='#', **kwargs)[source]
Parameters:
  • strip_lines (bool) – remove whitespace from the beginning an ending of each line (or not). Optional: defaults to True

  • strip_newlines – remove newlines only (only functions when strip_lines is false) Optional: defaults to False

  • provide_blank (bool) – are empty lines considered valid and provided? Optional: defaults to False

  • comment_char (str) – character(s) that indicate a line isn’t data (a comment) and should not be provided. Optional: defaults to ‘#’

filter(line)[source]

Determines whether to provide line or not.

Parameters:

line (str) – the incoming line from the source

Returns:

a line or None

class galaxy.datatypes.dataproviders.line.RegexLineDataProvider(source, regex_list=None, invert=False, **kwargs)[source]

Bases: FilteredLineDataProvider

Data provider that yields only those lines of data from its source that do (or do not when invert is True) match one or more of the given list of regexs.

Note

the regex matches are effectively OR’d (if any regex matches the line it is considered valid and will be provided).

settings: Dict[str, str] = {'comment_char': 'str', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}
__init__(source, regex_list=None, invert=False, **kwargs)[source]
Parameters:
  • regex_list (list (of str)) – list of strings or regular expression strings that will be match`ed to each line Optional: defaults to `None (no matching)

  • invert (bool) – if True will provide only lines that do not match. Optional: defaults to False

filter(line)[source]

Determines whether to provide line or not.

Parameters:

line (str) – the incoming line from the source

Returns:

a line or None

filter_by_regex(line)[source]
class galaxy.datatypes.dataproviders.line.BlockDataProvider(source, new_block_delim_fn=None, block_filter_fn=None, **kwargs)[source]

Bases: LimitedOffsetDataProvider

Class that uses formats where multiple lines combine to describe a single datum. The data output will be a list of either map/dicts or sub-arrays.

Uses FilteredLineDataProvider as its source (kwargs not passed).

e.g. Fasta, GenBank, MAF, hg log Note: mem intensive (gathers list of lines before output)

__init__(source, new_block_delim_fn=None, block_filter_fn=None, **kwargs)[source]
Parameters:
  • new_block_delim_fn (function) – T/F function to determine whether a given line is the start of a new block.

  • block_filter_fn (function) – function that determines if a block is valid and will be provided. Optional: defaults to None (no filtering)

init_new_block()[source]

Set up internal data for next block.

filter(line)[source]

Line filter here being used to aggregate/assemble lines into a block and determine whether the line indicates a new block.

Parameters:

line (str) – the incoming line from the source

Returns:

a block or None

is_new_block(line)[source]

Returns True if the given line indicates the start of a new block (and the current block should be provided) or False if not.

add_line_to_block(line)[source]

Integrate the given line into the current block.

Called per line.

assemble_current_block()[source]

Build the current data into a block.

Called per block (just before providing).

filter_block(block)[source]

Is the current block a valid/desired datum.

Called per block (just before providing).

handle_last_block()[source]

Handle any blocks remaining after the main loop.

settings: Dict[str, str] = {'limit': 'int', 'offset': 'int'}