galaxy.datatypes.dataproviders package¶
Submodules¶
galaxy.datatypes.dataproviders.base module¶
Base class(es) for all DataProviders.
- class galaxy.datatypes.dataproviders.base.HasSettings(name, base_classes, attributes)[source]¶
Bases:
typeMetaclass for data providers that allows defining and inheriting a dictionary named ‘settings’.
Useful for allowing class level access to expected variable types passed to class __init__ functions so they can be parsed from a query string.
- class galaxy.datatypes.dataproviders.base.DataProvider(source, **kwargs)[source]¶
Bases:
objectBase class for all data providers. Data providers:
have a source (which must be another file-like object)
implement both the iterator and context manager interfaces
do not allow write methods (but otherwise implement the other file object interface methods)
- __init__(source, **kwargs)[source]¶
Sets up a data provider, validates supplied source.
- Parameters:
source – the source that this iterator will loop over. (Should implement the iterable interface and ideally have the context manager interface as well)
- validate_source(source)[source]¶
Is this a valid source for this provider?
- Raises:
InvalidDataProviderSource – if the source is considered invalid.
Meant to be overridden in subclasses.
- class galaxy.datatypes.dataproviders.base.FilteredDataProvider(source, filter_fn=None, **kwargs)[source]¶
Bases:
DataProviderPasses each datum through a filter function and yields it if that function returns a non-None value.
- Also maintains counters:
num_data_read: how many data have been consumed from the source.
num_valid_data_read: how many data have been returned from filter.
num_data_returned: how many data has this provider yielded.
- __init__(source, filter_fn=None, **kwargs)[source]¶
- Parameters:
filter_fn – a lambda or function that will be passed a datum and return either the (optionally modified) datum or None.
- class galaxy.datatypes.dataproviders.base.LimitedOffsetDataProvider(source, offset=0, limit=None, **kwargs)[source]¶
Bases:
FilteredDataProviderA provider that uses the counters from FilteredDataProvider to limit the number of data and/or skip offset number of data before providing.
Useful for grabbing sections from a source (e.g. pagination).
- class galaxy.datatypes.dataproviders.base.MultiSourceDataProvider(source_list, **kwargs)[source]¶
Bases:
DataProviderA provider that iterates over a list of given sources and provides data from one after another.
An iterator over iterators.
galaxy.datatypes.dataproviders.chunk module¶
Chunk (N number of bytes at M offset to a source’s beginning) provider.
Primarily for file sources but usable by any iterator that has both seek and read( N ).
- class galaxy.datatypes.dataproviders.chunk.ChunkDataProvider(source, chunk_index=0, chunk_size=65536, **kwargs)[source]¶
Bases:
DataProviderData provider that yields chunks of data from its file.
Note: this version does not account for lines and works with Binary datatypes.
- MAX_CHUNK_SIZE = 65536¶
- DEFAULT_CHUNK_SIZE = 65536¶
- __init__(source, chunk_index=0, chunk_size=65536, **kwargs)[source]¶
- Parameters:
chunk_index – if a source can be divided into N number of chunk_size sections, this is the index of which section to return.
chunk_size – how large are the desired chunks to return (gen. in bytes).
- class galaxy.datatypes.dataproviders.chunk.Base64ChunkDataProvider(source, chunk_index=0, chunk_size=65536, **kwargs)[source]¶
Bases:
ChunkDataProviderData provider that yields chunks of base64 encoded data from its file.
galaxy.datatypes.dataproviders.column module¶
Providers that provide lists of lists generally where each line of a source is further subdivided into multiple data (e.g. columns from a line).
- class galaxy.datatypes.dataproviders.column.ColumnarDataProvider(source, indeces=None, column_count=None, column_types=None, parsers=None, parse_columns=True, deliminator='\t', filters=None, **kwargs)[source]¶
Bases:
RegexLineDataProviderData provider that provide a list of columns from the lines of its source.
Columns are returned in the order given in indeces, so this provider can re-arrange columns.
If any desired index is outside the actual number of columns in the source, this provider will None-pad the output and you are guaranteed the same number of columns as the number of indeces asked for (even if they are filled with None).
- settings: Dict[str, str] = {'column_count': 'int', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
- __init__(source, indeces=None, column_count=None, column_types=None, parsers=None, parse_columns=True, deliminator='\t', filters=None, **kwargs)[source]¶
- Parameters:
indeces (list or None) – a list of indeces of columns to gather from each row Optional: will default to None. If None, this provider will return all rows (even when a particular row contains more/less than others). If a row/line does not contain an element at a given index, the provider will-return/fill-with a None value as the element.
column_count (int) – an alternate means of defining indeces, use an int here to effectively provide the first N columns. Optional: will default to None.
column_types (list of strings) – a list of string names of types that the provider will use to look up an appropriate parser for the column. (e.g. ‘int’, ‘float’, ‘str’, ‘bool’) Optional: will default to parsing all columns as strings.
parsers (dictionary) – a dictionary keyed with column type strings and with values that are functions to use when parsing those types. Optional: will default to using the function _get_default_parsers.
parse_columns (bool) – attempt to parse columns? Optional: defaults to True.
deliminator (str) – character(s) used to split each row/line of the source. Optional: defaults to the tab character.
Note
that the subclass constructors are passed kwargs - so they’re params (limit, offset, etc.) are also applicable here.
- create_numeric_filter(column, op, val)[source]¶
Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.
The function will compare the column at index column against val using the given op where op is one of:
lt: less than
le: less than or equal to
eq: equal to
ne: not equal to
ge: greather than or equal to
gt: greater than
val is cast as float here and will return None if there’s a parsing error.
- create_string_filter(column, op, val)[source]¶
Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.
The function will compare the column at index column against val using the given op where op is one of:
eq: exactly matches
has: the column contains the substring val
re: the column matches the regular expression in val
- create_list_filter(column, op, val)[source]¶
Return an anonymous filter function that will be passed the array of parsed columns. Return None if no filter function can be created for the given params.
The function will compare the column at index column against val using the given op where op is one of:
eq: the list val exactly matches the list in the column
has: the list in the column contains the sublist val
- get_default_parsers()[source]¶
Return parser dictionary keyed for each columnar type (as defined in datatypes).
Note
primitives only by default (str, int, float, boolean, None). Other (more complex) types are retrieved as strings.
- Returns:
a dictionary of the form: { <parser type name> : <function used to parse type> }
- filter(line)[source]¶
Determines whether to provide line or not.
- Parameters:
line (str) – the incoming line from the source
- Returns:
a line or None
- parse_columns_from_line(line)[source]¶
Returns a list of the desired, parsed columns. :param line: the line to parse :type line: str
- parse_column_at_index(columns, parser_index, index)[source]¶
Get the column type for the parser from self.column_types or None if the type is unavailable.
- parse_value(val, type)[source]¶
Attempt to parse and return the given value based on the given type.
- Parameters:
val – the column value to parse (often a string)
type – the string type ‘name’ used to find the appropriate parser
- Returns:
the parsed value or value if no type found in parsers or None if there was a parser error (ValueError)
- class galaxy.datatypes.dataproviders.column.DictDataProvider(source, column_names=None, **kwargs)[source]¶
Bases:
ColumnarDataProviderData provider that zips column_names and columns from the source’s contents into a dictionary.
A combination use of both column_names and indeces allows ‘picking’ key/value pairs from the source.
Note
The subclass constructors are passed kwargs - so their params (limit, offset, etc.) are also applicable here.
- settings: Dict[str, str] = {'column_count': 'int', 'column_names': 'list:str', 'column_types': 'list:str', 'comment_char': 'str', 'deliminator': 'str', 'filters': 'list:str', 'indeces': 'list:int', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'parse_columns': 'bool', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
- __init__(source, column_names=None, **kwargs)[source]¶
- Parameters:
column_names – an ordered list of strings that will be used as the keys for each column in the returned dictionaries. The number of key, value pairs each returned dictionary has will be as short as the number of column names provided.
galaxy.datatypes.dataproviders.dataset module¶
galaxy.datatypes.dataproviders.decorators module¶
galaxy.datatypes.dataproviders.exceptions module¶
DataProvider related exceptions.
- exception galaxy.datatypes.dataproviders.exceptions.InvalidDataProviderSource(source=None, msg='')[source]¶
Bases:
TypeErrorRaised when a unusable source is passed to a provider.
- exception galaxy.datatypes.dataproviders.exceptions.NoProviderAvailable(factory_source, format_requested=None, msg='')[source]¶
Bases:
TypeErrorRaised when no provider is found for the given format_requested.
- Parameters:
factory_source – the item that the provider was requested from
format_requested – the format_requested (a hashable key to access factory_source.datatypes with)
Both params are attached to this class and accessible to the try-catch receiver.
Meant to be used within a class that builds dataproviders (e.g. a Datatype)
galaxy.datatypes.dataproviders.external module¶
galaxy.datatypes.dataproviders.hierarchy module¶
galaxy.datatypes.dataproviders.line module¶
Dataproviders that iterate over lines from their sources.
- class galaxy.datatypes.dataproviders.line.FilteredLineDataProvider(source, strip_lines=True, strip_newlines=False, provide_blank=False, comment_char='#', **kwargs)[source]¶
Bases:
LimitedOffsetDataProviderData provider that yields lines of data from its source allowing optional control over which line to start on and how many lines to return.
- DEFAULT_COMMENT_CHAR = '#'¶
- settings: Dict[str, str] = {'comment_char': 'str', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
- __init__(source, strip_lines=True, strip_newlines=False, provide_blank=False, comment_char='#', **kwargs)[source]¶
- Parameters:
strip_lines (bool) – remove whitespace from the beginning an ending of each line (or not). Optional: defaults to True
strip_newlines – remove newlines only (only functions when
strip_linesis false) Optional: defaults to Falseprovide_blank (bool) – are empty lines considered valid and provided? Optional: defaults to False
comment_char (str) – character(s) that indicate a line isn’t data (a comment) and should not be provided. Optional: defaults to ‘#’
- class galaxy.datatypes.dataproviders.line.RegexLineDataProvider(source, regex_list=None, invert=False, **kwargs)[source]¶
Bases:
FilteredLineDataProviderData provider that yields only those lines of data from its source that do (or do not when invert is True) match one or more of the given list of regexs.
Note
the regex matches are effectively OR’d (if any regex matches the line it is considered valid and will be provided).
- settings: Dict[str, str] = {'comment_char': 'str', 'invert': 'bool', 'limit': 'int', 'offset': 'int', 'provide_blank': 'bool', 'regex_list': 'list:escaped', 'strip_lines': 'bool', 'strip_newlines': 'bool'}¶
- class galaxy.datatypes.dataproviders.line.BlockDataProvider(source, new_block_delim_fn=None, block_filter_fn=None, **kwargs)[source]¶
Bases:
LimitedOffsetDataProviderClass that uses formats where multiple lines combine to describe a single datum. The data output will be a list of either map/dicts or sub-arrays.
Uses FilteredLineDataProvider as its source (kwargs not passed).
e.g. Fasta, GenBank, MAF, hg log Note: mem intensive (gathers list of lines before output)
- __init__(source, new_block_delim_fn=None, block_filter_fn=None, **kwargs)[source]¶
- Parameters:
new_block_delim_fn (function) – T/F function to determine whether a given line is the start of a new block.
block_filter_fn (function) – function that determines if a block is valid and will be provided. Optional: defaults to None (no filtering)
- filter(line)[source]¶
Line filter here being used to aggregate/assemble lines into a block and determine whether the line indicates a new block.
- Parameters:
line (str) – the incoming line from the source
- Returns:
a block or None
- is_new_block(line)[source]¶
Returns True if the given line indicates the start of a new block (and the current block should be provided) or False if not.
- assemble_current_block()[source]¶
Build the current data into a block.
Called per block (just before providing).