Welcome to datesy’s documentation!¶
Datesy, making DATa handling EaSY – the intro to data handling in python
Introduction¶
Making intro to data handling in python nice and easy!
The datesy
package provides easy handling (read/write) of standard file types, the conversion between the file types
as well as basic data inspection functionalities.
It is designed for an easy start into the python world and data handling in it without having to think of unnecessary basics.
Main Usage¶
Datesy, making DATa handling EaSY, is mostly helpful if you are looking for:
- loading/dumping data to a standard file format like json, csv, xml, xls(x)
- inspecting complex data like searching for a path in a dictionary
- mapping strings and their properties
Motivation¶
History¶
The idea to this package came during the work as a consultant with a customer where lot’s of files needed to be read, transformed, inspected etc. and no adequate tools besides searching & filtering with Excel of files partly in the range of GBs were around.
With starting to share the python insights and the code fragments, the only logical next step was do create a really reusable code fragment - a python package.
Future Development¶
The package is designed to be an easy start into data handling with python. Therefore, its desire is to take care of standard tasks the programmer just does not want to think of but can concentrate on the actual tasks, the data handling.
Whenever there is a task that is done too often in data handling and inspection which can be taken care of in a standardized way, this package will happily be expanded for enabling you to simplify your job.
If needed/desired, further datafile-formats will be supported for having a nice and standardized way of loading/writing those as well.
Limitations¶
This package is designed to be used by anybody who is new to python. Therefore functions are explicitly held limited to their magic and described accordingly.
There are few things really the big shit rather than simply helping with small tasks which you could have written yourself in a few lines of code but didn’t want to think about.
For deep data analysis other packages are far more powerful and maybe helpful to you. Think of datesy
more of the little butler taking care of some basic tasks for you.
This package is compatible to PyPy’s version 3.6.
The datesy package¶
The datesy package is divided in 5 main components:
Subpackages¶
All actions of interacting with files are to be found here:
File I/O subpackage¶
All functions of datesy taking care of file I/O are listed here, separated by file_type. Only exception is the first module, providing functions for file selection.
file_selection module¶
The file_selection module provides multiple supporting functions for interaction with files
-
datesy.file_IO.file_selection.
get_latest_file_from_directory
(directory, file_ending=None, pattern=None, regex=None)¶ Return the latest file_name (optionally filtered) from a directory
Parameters: - directory (str) – the directory where to get the latest file_name from
- file_ending (str, set, optional) – the file_name ending specifying the file_name type
- pattern (str, optional) – pattern for the file_name to match
DataFile_*.json
where*
could be a date or other strings - regex (str, optional) – a regular_expression (regex) for pattern matching
Returns: the file_name with the latest change date
Return type: str
-
datesy.file_IO.file_selection.
get_file_list_from_directory
(directory, file_ending=None, pattern=None, regex=None)¶ Return all files (optionally filtered) from directory in a list
Parameters: - directory (str) – the directory containing the desired files
- file_ending (str, set, optional) – the file_name’s ending specifying the file type
- pattern (str, optional) – pattern for the file_names to match
DataFile_*.json
where*
could be a date or other strings - regex (str, optional) –
a regular_expression (regex) for pattern matching
Returns: a list of all relative file_name directories
Return type: list
-
datesy.file_IO.file_selection.
return_file_list_if_path
(path, file_ending=None, pattern=None, regex=None, return_always_list=False)¶ Return all files in directory (optionally specified with options) if path is a directory
Parameters: - path (str) – the path to test if directory
- file_ending (str, set, optional) – the file_name ending specifying the file_name type for the files in the directory
- pattern (str, optional) – pattern for the file_names in directory to match
DataFile_*.json
where*
could be a date or other strings - regex (str, optional) –
a regular_expression (regex) for pattern matching of the file_names
- return_always_list (bool, optional) – if a single path shall be returned as in a list
Returns: if directory the list of files else the path (in a list if return_always_list is set)
Return type: list, str
-
datesy.file_IO.file_selection.
check_file_name_ending
(file_name, ending)¶ Check if the file_name has the expected file_ending
If one of the provided endings is the file_name’s ending return True, else False
Parameters: - file_name (str) – The file_name to check the ending for
The file_name may contain a path, so
file_name.ending
as well aspath/to/file_name.ending
will work - ending (str, set, list) – The desired ending or multiple desired endings
For single entries e.g.
.json
orcsv
, for multiple endings e.g.['.json', 'csv']
Returns: True if the file_name’s ending is in the given ending, else False
Return type: bool
- file_name (str) – The file_name to check the ending for
The file_name may contain a path, so
json_file module¶
The json_file module takes care of all I/O interactions concerning json files
-
datesy.file_IO.json_file.
load
(path)¶ Load(s) json file(s) and returns the dictionary/-ies Specifying a file_name: one file will be loaded. Specifying a directory: all *.json files will be loaded.
Parameters: path (str) – path to a file_name or directory Returns: dictionary representing the json {file_name: {data}}
Return type: dict
-
datesy.file_IO.json_file.
load_single
(file_name)¶ Load a single json file
Parameters: file_name (str) – file_name to load from Returns: the loaded json as a dict {data}
Return type: dict
-
datesy.file_IO.json_file.
load_these
(file_name_list)¶ Load specified json files and return the data in a dictionary with file_name as key
Parameters: file_name_list (list) – list of file_names to load from Returns: the dictionaries from the files as values of file_name as key {file_name: {data}}
Return type: dict(dict)
-
datesy.file_IO.json_file.
load_all
(directory)¶ Load all json files in the directory and return the data in a dictionary with file_name as key
Parameters: directory (str) – the directory containing the json files Returns: the dictionaries from the files as values of file_name as key {file_name: {data}}
Return type: dict(dict)
-
datesy.file_IO.json_file.
write
(file_name, data, beautify=True, sort=False)¶ Save json from dict to file
Parameters: - file_name (str) – the file_name to save under. if no ending is provided, saved as .json
- data (dict) – the dictionary to be saved as json
- beautify (bool, optional) – if the data is represented in single row or human readable presented (default: human readable)
- sort (bool, optional) – if the keys shall be ordered (default: false)
csv_file module¶
The csv_file module takes care of all I/O interactions concerning csv files
-
datesy.file_IO.csv_file.
load
(path, **kwargs)¶ Load(s) csv file(s) and returns the rows Specifying a file_name: one file will be loaded. Specifying a directory: all *.csv files will be loaded.
Parameters: - path (str) – path to a file_name or directory
- kwargs (optional) – csv dialect options
Returns: list of lists if a single file_name was provided:
[[row1.1, row1.2]]
dict of list of lists if multiple files provided:{file_name : [[row1.1, row1.2]]}
Return type: list, dict
-
datesy.file_IO.csv_file.
load_single
(file_name, **kwargs)¶ Load a csv file and return the rows
Parameters: - file_name (str) – file_name to load from
- kwargs (optional) – csv dialect options
Returns: list of lists representing the csv data
[[row1.1, row1.2]]
Return type: list
-
datesy.file_IO.csv_file.
load_these
(file_name_list, **kwargs)¶ Load specified csv files and return the rows in a dictionary with file_name as key
Parameters: - file_name_list (list) – list of file_names to load from
- kwargs (optional) – csv dialect options
Returns: the rows from the files as values of file_name as key
{file_name : [[row1.1, row1.2]]}
Return type: dict
-
datesy.file_IO.csv_file.
load_all
(directory, **kwargs)¶ Load all csv files in the directory and return the rows in a dictionary with file_name as key
Parameters: - directory (str) – the directory containing the csv files
- kwargs (optional) – csv dialect options
Returns: the rows from the files as values of file_name as key
{file_name : [[row1.1, row1.2]]}
Return type: dict
-
datesy.file_IO.csv_file.
write
(file_name, data, main_key_name=None, main_key_position=0, order=None, if_empty_value=None, **kwargs)¶ Save a row based document from dict or list to file If presented a dictionary, converting to rows is done by the dict_to_rows method of this package.
Parameters: - file_name (str) – the file_name to save under. if no ending is provided, saved as file_name.csv
- data (dict, list) – the dictionary or list to be saved as csv
- main_key_name (str, optional) – if the json or dict does not have the main key as a single key present (
{main_element_name: dict}
), it needs to be specified - main_key_position (int, optional) – the position in csv of the dictionary main key
- order (dict, list, optional) – for defining a specific order of the keys. if dict, format:
{int: str}
either a dictionary with the specified positions in a dictionary with positions as keys (integers) or in a list - if_empty_value (any, optional) – the value to set when no handling is available default is “delete” leading to be an empty value
- kwargs (optional) – csv dialect options
-
datesy.file_IO.csv_file.
write_from_rows
(file_name, rows, **kwargs)¶ Save row based document from rows to file
Parameters: - file_name (str) – the file_name to save the data under. if no ending is provided, saved as file_name.csv
- rows (list) – list of lists to write to file_name
- kwargs (optional) – csv dialect options
-
datesy.file_IO.csv_file.
write_from_dict
(file_name, data, main_key_name=None, main_key_position=0, order=None, if_empty_value=None, **kwargs)¶ Save a row based document from dict to file
Parameters: - file_name (str) – the file_name to save under. if no ending is provided, saved as file_name.csv
- data (dict) – the dictionary to be saved as csv
- main_key_name (str, optional) – if the json or dict does not have the main key as a single key present (
{main_element_name: dict}
), it needs to be specified - order (dict {int: str}, list, optional) – for defining a specific order of the keys either a dictionary with the specified positions in a dictionary with positions as keys (integers) or in a list
- if_empty_value (any, optional) – the value to set when no handling is available default is “delete” leading to be an empty value
- main_key_position (int, optional) – the position in csv of the dictionary main key
- kwargs (optional) – csv dialect options
xls_file module¶
The xls_file module takes care of all I/O interactions concerning xls(x) files
-
datesy.file_IO.xls_file.
load_single_sheet
(file_name, sheet=None)¶ Load a xls(x) file’s (first) sheet to a pandas.DataFrame
Parameters: - file_name (str) – file_name to load from
- sheet (str, optional) – a specified sheet_name to extract. default is first sheet
Returns: pandas.DataFrame representing the xls(x) file
Return type: pandas.DataFrame
-
datesy.file_IO.xls_file.
load_these_sheets
(file_name, sheets)¶ Load from a xls(x) file_name the specified sheets to a pandas.DataFrame as values to sheet_names as keys in a dictionary
Parameters: - file_name (str) – file_name to load from
- sheets (list) – sheet_names to load
Returns: dictionary containing the sheet_names as keys and pandas.DataFrame representing the xls(x) sheets
{sheet_name: pandas.DataFrame}
Return type: dict(pandas.DataFrame)
-
datesy.file_IO.xls_file.
load_all_sheets
(file_name)¶ Load from a xls(x) file all its sheets to a pandas.DataFrame as values to sheet_names as keys in a dictionary
Parameters: file_name (str) – file_name to load from Returns: dictionary containing the sheet_names as keys and pandas.DataFrame representing the xls(x) sheets {sheet_name: pandas.DataFrame}
Return type: dict
-
datesy.file_IO.xls_file.
load_these_files
(file_name_list)¶ Load the specified xls(x) files with all their sheets to a pandas.DataFrame as values to sheet_names as keys in a dictionary
Parameters: file_name_list (list) – list of file_names to load from Returns: the data from the sheets in a dictionary with sheet_name as key within again a dictionary with file_name as key {file_name: {sheet_name: pandas.DataFrame}}
Return type: dict
-
datesy.file_IO.xls_file.
load_all_files
(directory)¶ Load all xls(x) files in the directory with all their sheets to a pandas.DataFrame as values to sheet_names as keys in a dictionary
Parameters: directory (str) – the directory containing the xlsx files Returns: the data from the sheets in a dictionary with sheet_name as key within again a dictionary with file_name as key {file_name: {sheet_name: pandas.DataFrame}}
Return type: dict
-
datesy.file_IO.xls_file.
load
(path)¶ Load all xls(x) files in the directory with all their sheets to a pandas.DataFrame as values to sheet_names as keys in a dictionary Specifying a file_name: one file will be loaded. Specifying a directory: all *.xls(x) files will be loaded.
Parameters: path (str) – path to a file_name or directory Returns: dictionary containing the sheets as panda.DataFrames: {file_name: {sheet_name: pandas.DataFrame}}
Return type: dict
-
datesy.file_IO.xls_file.
write_single_sheet_from_DataFrame
(file_name, data_frame, sheet_name=None, auto_size_cells=True)¶ Save a pandas.DataFrame to file
Parameters: - file_name (str) – the file_name to save under. if no ending is provided, saved as .xlsx
- data_frame (pandas.DataFrame) – pandas.DataFrame to write to file_name
- sheet_name (str, optional) – a sheet_name containing the data
- auto_size_cells (bool, optional) – if the auto-sizing of the cells shall be active
-
datesy.file_IO.xls_file.
write_multi_sheet_from_DataFrames
(file_name, data_frames, sheet_order=None, auto_size_cells=True)¶ Save multiple pandas.DataFrames to one file
Parameters: - file_name (str) – the file_name to save under. if no ending is provided, saved as .xlsx
- data_frames (dict {sheet_name: DataFrame}) – dict of data_frames
- sheet_order (dict {int: str}, list, optional) – either a dictionary with the specified positions in a dictionary with positions as keys (integers) or in a list
- auto_size_cells (bool, optional) – if the auto-sizing of the cells shall be active
-
datesy.file_IO.xls_file.
write_single_sheet_from_dict
(file_name, data, main_key_name=None, sheet=None, order=None, inverse=False, auto_size_cells=True)¶ Save a dictionary (
{main_key_name: {data}}
) as xlsx document to file Uses the dict_to_pandas_data_frame method of this package for converting the dictionary to pandas.DataFrame.Parameters: - file_name (str) – the file_name to save under. if no ending is provided, saved as .xlsx
- data (dict) – the dictionary to be saved as xlsx
{main_key_name: {data}}
- main_key_name (str, optional) – if the json or dict does not have the main key as a single {main_element : dict} present, it needs to be specified
- sheet (str, optional) – a sheet name for the handling
- order (dict, list, optional) – either a dictionary with the specified positions in a dictionary with positions as keys (integers) or in a list
- inverse (bool, optional) – if columns and rows shall be switched
- auto_size_cells (bool, optional) – if the auto-sizing of the cells shall be active
-
datesy.file_IO.xls_file.
write_multi_sheet_from_dict_of_dicts
(file_name, data, order=None, auto_size_cells=True)¶ Save dictionaries (
{sheet_name: {main_key_name: {data}}}
) as xlsx document to file Uses the dict_to_pandas_data_frame method of this package for converting the dictionary to pandas.DataFrame.Parameters: - file_name (str) – the file_name to save under. if no ending is provided, saved as .xlsx
- data (dict) – the dictionary to be saved as xlsx
{sheet_name: {main_key_name: {data}}}
- order (dict, list, optional) – either a dictionary with the specified positions in a dictionary with positions as keys (integers) or in a list
- auto_size_cells (bool, optional) – if the auto-sizing of the cells shall be active
xml_file module¶
The xml_file module takes care of all I/O interactions concerning xml files
-
datesy.file_IO.xml_file.
load
(path)¶ Load(s) json file(s) and returns the dictionary/-ies Specifying a file_name: one file will be loaded. Specifying a directory: all *.json files will be loaded.
Parameters: path (str) – path to a file_name or directory Returns: dictionary representing the json {file_name: {data}}
Return type: dict
-
datesy.file_IO.xml_file.
load_single
(file_name)¶ Load a single xml file
Parameters: file_name (str) – file_name to load from Returns: the xml as ordered dict {collections.OrderedDict}
Return type: dict
-
datesy.file_IO.xml_file.
load_these
(file_name_list)¶ Load specified xml files and return the data in a dictionary with file_name as key
Parameters: file_name_list (list) – list of file_names to load from Returns: the dictionaries from the files as values of file_name as key {file_name: {collections.OrderedDict}
Return type: dict(collections.OrderedDict)
-
datesy.file_IO.xml_file.
load_all
(directory)¶ Load all xml files in the directory and return the data in a dictionary with file_name as key
Parameters: directory (str) – the directory containing the xml files Returns: the dictionaries from the files as values of file_name as key {file_name: {collections.OrderedDict}}
Return type: dict(collections.OrderedDict)
-
datesy.file_IO.xml_file.
write
(file_name, data, main_key_name=None)¶ Save xml file from dict or collections.OrderedDict to file
Parameters: - file_name (str) – the file_name to save under. if no ending is provided, saved as .xml
- data (dict, collections.OrderedDict) – the dictionary to be saved as xml
- main_key_name (str) – if the dict/OrderedDict does not have the main key as a single key present (
{main_element_name: dict}
), it needs to be specified
All actions of interacting with databases are to be found here:
Database I/O subpackage¶
All functions of datesy taking care of database I/O are listed here.
The functionality is the same for all supported database types.
Simply exchange _db_helper.Database
with the desired database.
Database¶
-
class
datesy.database_IO._db_helper.
Database
(host, port, user, password, database, auto_creation=False)¶ Representing a database as an object
On initialization the connection to the database is established. For clean working please call
close()
at end of db usage.Parameters: - host (str) – url or ip of host
- port (int) – port_no
- user (str) – user_name
- password (str) – password for this user
- database (str) – the database to connect to
- auto_creation (bool, optional) – if all tables shall be initiated as variables of object
- kwargs – specific information to database, see details to each database
-
close
()¶ Close connection to database
-
name
¶ Name of the database
Returns: Return type: str
-
table
(table_name)¶ Return a database_table as an object
Parameters: table_name (str) – the desired table Returns: class Table as representation of table Return type: Table
-
tables
¶ Get the available tables of database
Returns: representing all tables of database Return type: list
-
update_table_data
()¶ Update the data concerning the list of available tables
Returns: available tables at database Return type: list
Table¶
-
class
datesy.database_IO._db_helper.
Table
(table_name, database)¶ Create a representation of a database table
Parameters: - table_name (str) –
- database (Database) –
-
__getitem__
(key)¶ Get row of primary key
works like
value = database.table[key]
Parameters: key (any) – matching value in primary column Returns: tuple items representing every matched row in database Return type: Row
-
__setitem__
(primary_key, row)¶ Set/update a single row for primary key
works like
database.table[key] = row
Parameters: - primary_key (any, None) – the value of the primary column. If None -> new row inserted
- row (list, dict) – the row data in either correct order or in a dict with column_name
-
__delitem__
(key)¶ Delete single row for given primary key
works like
del database.table[key]
Parameters: key (any) – matching value in primary column
-
delete_where
(*args, **kwargs)¶ Delete rows matching the where conditions
Parameters: - args (conditions) –
- kwargs (conditions) –
-
execute_raw_sql
(sql_query)¶ Execute raw sql statements
Parameters: sql_query (str) – sql query Returns: data from database Return type: list
-
get_where
(*args, **kwargs)¶ Get rows where value matches the defined column:
columns=key
Parameters: **kwargs – column = key values Returns: tuple items representing every matched row in database Return type: list(Row)
-
insert
(row: (<class 'list'>, <class 'dict'>), primary_key=None)¶ Insert new row
Parameters: - row (list, dict) – row_data
- primary_key (any, optional) – primary_key optional for tables with primary_key
-
primary
¶ Get the primary key of this table
Returns: the primary column as string if one exists Return type: str, None
-
run_query
(debug=False)¶ Run the currently composed query :param debug: if the query shall be printed to command_line :type debug: bool, optional
Returns: data from database Return type: list
-
schema
¶ Get schema of table
Returns: dictionary containing the column as key and schema_info as dictionary for every column Return type: OrderedDict
-
update_primary_data
()¶ Update the primary key of the table
Returns: the primary column as string if one exists Return type: str, None
-
update_schema_data
()¶ Update the schema of the table
Returns: dictionary containing the column as key and schema_info as dictionary for every column Return type: OrderedDict
-
update_where
(values: (<class 'list'>, <class 'dict'>), *args, primary_key=None, limit_rows: int = False, **kwargs)¶ Update all rows based on given conditions
Parameters: - values (list, dict) – new values to set. either as a full row in a list or specified columns with dictionary
- args – conditions
- primary_key (any, optional) – the value of the primary column (if table has primary_key)
- limit_rows (int) – number of rows to affect
- kwargs – conditions
Row¶
-
class
datesy.database_IO._db_helper.
Row
(table, data)¶ Representation of a database row entry
Parameters: - table (Table) – table belonging to
- data (list, tuple, dict) – data to represent
-
__getitem__
(column)¶ Get column_value of row
works like
value = database.table.row[column]
Parameters: column (int, str) – position in table or column_name Returns: Return type: Item
-
__setitem__
(column, value)¶ Set new value to column
works like
database.table.row[column] = value
Parameters: - column (int, str) – position in table or column_name
- value (any) – new value
-
__delitem__
(column)¶ Delete/reset to default a column
works like
del database.table.row[column]
Parameters: column (int, str) – position in table or column_name
-
sync
(*missing_columns)¶ Update row from database to local
Parameters: missing_columns (str, optional) – if rows shall be left out when updating (e.g. if known that a timestamp has changed and it shall be fetched)
Item¶
-
class
datesy.database_IO._db_helper.
Item
(row, column, table, value=None)¶ Representation of a database entry
Parameters: -
__set__
(instance, value)¶ Set new value
Parameters: - instance –
- value (any) – new value
-
__delete__
(instance)¶ Delete/reset to default this value
-
column
¶ Column this item belonging to
Returns: column Return type: str
-
sync
()¶ Update entry from database to local
-
value
¶ Value of this item
Returns: value Return type: any
-
Submodules¶
datesy.convert module¶
All actions of transforming data from different file formats are to be found here
-
datesy.convert.
rows_to_dict
(rows, main_key_position=0, null_value='delete', header_line=0, contains_open_ends=False)¶ Convert a row of rows (e.g. csv) to dictionary
Parameters: - rows (list) – the row based data to convert to dict
- main_key_position (int, optional) – if the main_key is not on the top left, its position can be specified
- null_value (any, optional) – if an emtpy field in the lists shall be represented somehow in the dictionary
- header_line (int, optional) – if the header_line is not the first one, its position can be specified
- contains_open_ends (bool, optional) – if each row is not in the same length (due to last set entry as last element in row), a length check for corrupted data can be ignored
Returns: dictionary containing the information from row-based data
Return type: dict
-
datesy.convert.
dict_to_rows
(data, main_key_name=None, main_key_position=None, if_empty_value=None, order=None)¶ Convert a dictionary to rows (list(lists))
Parameters: - data (dict) – the data to convert in form of a dictionary
- main_key_name (str, optional) – if the data isn’t provided as {main_key: data} the key needs to be specified
- main_key_position (int, optional) – if the main_key shall not be on top left of the data the position can be specified
- if_empty_value (any, optional) – if a main_key’s sub_key is not set something different than blank can be defined
- order (dict, list, None, optional) – if a special order for the keys is required
Returns: list of rows representing the csv based on the main_element_position
Return type: list(lists)
-
datesy.convert.
pandas_data_frame_to_dict
(data_frame, main_key_position=0, null_value='delete', header_line=0)¶ Converts a single file_name from xlsx to json
Parameters: - data_frame (pandas.core.frame.DataFrame) –
- main_key_position (int, optional) –
- null_value (any, optional) –
- header_line (int, optional) –
Returns: the dictionary representing the xlsx based on main_key_position
Return type: dict
-
datesy.convert.
dict_to_pandas_data_frame
(data, main_key_name=None, order=None, inverse=False)¶ Convert a dictionary to pandas.DataFrame
Parameters: - data (dict) – dictionary of handling
- main_key_name (str, optional) – if the json or dict does not have the main key as a single {main_element : dict} present, it needs to be specified
- order (dict, list, optional) – list with the column names in order or dict with specified key positions
- inverse (bool, optional) – if columns and rows shall be switched
Returns: DataFrame representing the dictionary
Return type: pandas.DataFrame
-
datesy.convert.
xml_to_standard_dict
(ordered_data, reduce_orderedDicts=False, reduce_lists=False, manual_selection_for_list_reduction=False)¶ Convert a xml/orderedDict to normal dictionary
Parameters: - ordered_data (orderedDict) – input xml data to convert to standard dict
- reduce_orderedDicts (bool, optional) – if collections.orderedDicts shall be converted to normal dicts
- reduce_lists (bool, list, set, optional) – if lists in the dictionary shall be converted to dictionaries with transformed keys (list_key + unique key from dictionary from list_element) if list or set is provided, only these values will be reduced
- manual_selection_for_list_reduction (bool, optional) – if manually decision on list reduction shall be used
all keys in
reduce_lists
will be automatically reduced
Returns: the normalized dictionary
Return type: dict
datesy.inspect module¶
All actions of inspecting data are to be found here
-
datesy.inspect.
find_header_line
(data, header_keys)¶ Find the header line in row_based data_structure NOT IMPLEMENTED YET: Version 0.9 feature
Parameters: - data (list, pandas.DataFrame) –
- header_keys (str, list, set) – some key(s) to find in a row
Returns: the header_line
Return type: int
-
datesy.inspect.
find_key
(data, key=None, regex_pattern=None)¶ Find a key in a complex dictionary
Parameters: - data (dict) – the data structure to find the key
- key (str, optional) – a string to be found
- regex_pattern (str, optional) – a regex match to be found
Returns: all matches and their path in the structure
{found_key: path_to_key}
Return type: dict
datesy.matching module¶
All actions of mapping data to other data as well as the functions helpful for that are to be found here
-
datesy.matching.
simplify_strings
(to_simplify, lower_case=True, simplifier=True)¶ Simplify a string, set(strings), list(strings), keys in dict Options for simplifying include: lower capitals, separators, both (standard), own set of simplifier
Parameters: - to_simplify (list, set, string) – the string(s) to simplify presented by itself or as part of another data format
- lower_case (bool, optional) – if the input shall be converted to only lower_case (standard: True)
- simplifier (str, optional) – the chars to be removed from the string. if type bool and True, standard chars
_ , | \n ' & " % * - \
used
Returns: simplified values
{simplified_value: input_value}
Return type: dict
-
datesy.matching.
ease_match_similar
(list_for_matching, list_to_be_matched_to, simplified=False, similarity_limit_for_matching=0.6, print_auto_matched=False)¶ Return a dictionary with
list_for_matching
as keys andlist_to_be_matched_to
as values based on most similarity. Matching twice to the same value is possible! Similarity distance for stopping the matching is set by distance_for_automatic_vs_manual_matching. Faster than datesy.matching.match_comprehensive but when having very similar strings more likely to contain errors.Parameters: - list_for_matching (list, set) – Iterable of strings which shall be matched
- list_to_be_matched_to (list, set) – Iterable of stings which shall be matched to
- simplified (False, "capital", "separators", "all", list, str, optional) – For reducing the values by all small letters or unifying & deleting separators separators or any other list of strings provided
- print_auto_matched (bool, optional) – Printing the matched entries during process (most likely for debugging)
- similarity_limit_for_matching (float, optional) – For not matching the most irrelevant match which could exist
Returns: - match (dict) – {value_for_matching: value_to_be_mapped_to}
- no_match (set) – A set of all values from list_for_matching that could not be matched
-
datesy.matching.
match_comprehensive
(list_for_matching, list_to_be_matched_to, simplified=False)¶ Return a dictionary with
list_for_matching
as keys andlist_to_be_matched_to
as values based on most similarity. All values of both iterables get compared to each other and highest similarities are picked. Slower than datesy.matching.ease_match_similar but more precise.Parameters: - list_for_matching (list, set) – Iterable of strings which shall be matched
- list_to_be_matched_to (list, set) – Iterable of stings which shall be matched to
- simplified (False, "capital", "separators", "all", list, str, optional) – For reducing the values by all small letters or unifying & deleting separators separators or any other list of strings provided
Returns: - match (dict) – {value_for_matching: value_to_be_mapped_to}
- no_match (set) – A set of all values from list_for_matching that could not be matched
-
datesy.matching.
match_similar_with_manual_selection
(list_for_matching, list_to_be_matched_to, simplified=False, minimal_distance_for_automatic_matching=0.1, print_auto_matched=False, similarity_limit_for_manual_checking=0.6)¶ Return a dictionary with
list_for_matching
as keys andlist_to_be_matched_to
as values based on most similarity. All possible matches not matched automatically (set limit with minimal_distance_for_automatic_matching) can be handled interactively. Similarity distance for stopping the matching is set by distance_for_automatic_vs_manual_matching.Parameters: - list_for_matching (list, set) – Iterable of strings which shall be matched
- list_to_be_matched_to (list, set) – Iterable of stings which shall be matched to
- simplified (False, "capital", "separators", "all", list, str, optional) – For reducing the values by all small letters or unifying & deleting separators separators or any other list of strings provided
- print_auto_matched (bool, optional) – Printing the matched entries during process (most likely for debugging)
- minimal_distance_for_automatic_matching (float, optional) – If there is a vast difference between the most and second most matching value, automatically matching is provided This parameter provides the similarity distance to be reached for automatically matching
- similarity_limit_for_manual_checking (float, optional) – For not showing/matching the most irrelevant match which could exist
Returns: - match (dict) – {value_for_matching: value_to_be_mapped_to}
- no_match (set) – A set of all values from list_for_matching that could not be matched
Examples¶
Installation/Usage¶
For installation run pip3 install datesy in terminal.
For using in Python3 script, import it at the beginning of the script:
import datesy
# your code
pass
File interaction¶
Check here all the examples for interacting with files
Selecting Files¶
The file selection provides multiple ways for retrieving the desired files. All selecting functions contain three possibility to match:
- file_ending: matching 100% of the part after the last
.
- pattern: standard pattern for finding fixed strings with wildcards (like
SomeFixedName_*.csv
with * representing all kinds of string) - regex: a standard regular_expression (regex) matching the filename
The easiest way to get the latest file matching containing the name DataSource1
in the beginning and which is a .csv
file:
# Explicit way
file_name = datesy.file_selection.get_latest_file_from_directory(
directory="path/to/directory",
pattern="DataSource1*",
file_ending="csv"
)
# Shortened way
file_name = datesy.file_selection.get_latest_file_from_directory(
directory="path/to/directory",
pattern="DataSource1*.csv"
)
File I/O¶
The library provides a standardized way of interacting with files. For every file-type in the file_IO subpackage, there exist load- & write-functions following the same pattern. Only exception is the xls module due to the characteristics of sheets.
All-in-one/doing-all-the-magic loading functions¶
The most easy way to load data is with the load_file-type function. It is a shortcut for the specific ways of loading data in each file-type specific module:
data = datesy.load_csv(path="path/to/file.csv")
# data is list of lists representing the csv file
data = datesy.load_json(path="path/to/file.json")
# data is dictionary representing json file
The most easy way to write data is with the write_file-type function. It is again a shortcut to file-type specific modules:
# data is written to the csv file
datesy.write_csv(file_name="path/to/file.csv", data=data_to_write)
# data is written to the json file
datesy.load_json(file_name="path/to/file.json", data=data_to_write)
File-type specific modules: advanced reading/writing¶
For every file-type exist more specific functions for reading & writing the data. The presented examples from above are redirecting to the most general functions in the packages.
If using a IDE, the implemented functions will be shown to you directly with typing datesy./datesy.json_file.. If in interactive mode, simply type datesy.__all__/datesy.json_file.__all__. Switch the json_file to whatever submodule/-package you need.
Reading¶
The reading of the files is fairly simple
# load single json file
data = datesy.json_file.load_single(path="path/to/file.json")
# data is representing the json file
# load specific list of json files
data = datesy.json_file.load_these(file_name_list=["path/to/file1.json", "path/to/file2.json"])
# data is representing both json files; {file_name: json_file_value}
# load all json files from a directory
data = datesy.json_file.load_all(directory="/path/to/directory")
# data is representing all json files of this directory; {file_name: json_file_value}
# doing all of the above depending if `path` is file, list_of_files or directory
data = datesy.load_json(path="path/to/any")
# depending if single file or multiple files either dictionary representing json file or {file_name: json_file_value}
The last function is also reachable with the shortcut stated in the very beginning of the examples: datesy.load_json
Writing¶
For writing, the datesy package provides sometimes some more options for making life easier. The concept this package is designed, is to work most likely with data in form of a dictionary. Therefore, often shortcuts are provided.
Let’s have a look to row-based file-type csv (comma separated values): You can provide either row-based data (in python this would be a list of lists), or you can provide a dictionary instead and let datesy take care of the conversion. This little magic is part of the datesy.convert module, more details below.
# lets start with row-based data
example_rows = [
["Header1", "Header2", "Header3"],
["Value11", "Value12", "Value13"],
["Value21", "Value22", "Value23"]
]
datesy.csv_file.write_from_rows(file_name="path/to/csv_file.csv", rows=example_rows)
# The result in the file:
# Header1,Header2,Header3
# Value11,Value12,Value13
# Value21,Value22,Value23
# in difference with data in form of a dictionary
example_dict = {
"Header1": {
"Value11": {
"Header2": "Value12",
"Header3": "Value13"
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
}
datesy.csv_file.write_from_dict(file_name="path/to/csv_file.csv", data=example_dict)
# The result in the file is the same:
# Header1,Header2,Header3
# Value11,Value12,Value13
# Value21,Value22,Value23
# additionally the data can be provided without the naming of the main_key
# (in this case "Header1")
example_dict2 = {
"Value11": {
"Header2": "Value12",
"Header3": "Value13"
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
datesy.csv_file.write_from_dict(
file_name="path/to/csv_file.csv",
data=example_dict,
main_key_name="Header1",
main_key_position=0
)
# The result in the file is still the same:
# Header1,Header2,Header3
# Value11,Value12,Value13
# Value21,Value22,Value23
Again, there is a function combining both writing methods, available also with a shortcut stated
in the very beginning of the examples: datesy.write_csv
xls/xlsx Files¶
The Microsoft Excel file interaction works slightly different since sheets are a feature not available to standard file formats like json, csv or xml. The standard output format is Pandas DataFrame.
Yet, interaction is still fairly simple:
data_frame = datesy.xls_file.load_single_sheet(file_name="path/to/file.xls") # .xlsx works with the same function
# returns a pandas.data_frame from first sheet
# you can specify a sheet_name
data_frame = datesy.xls_file.load_single_sheet(file_name="path/to/file.xls", sheet="Sheet_Name")
# returns a pandas.data_frame from sheet with provided name
# of course multiple sheets can be loaded
data = datesy.xls_file.load_these_sheets(file_name="path/to/file.xls", sheets=["Sheet_Name1", "Sheet_Name2"])
# just like the other loading functions, the sheet_name is the key in a dictionary containing the data_frame as value
# {"Sheet_Name": DataFrame}
# loading all sheets
data = datesy.xls_file.load_all_sheets(file_name="path/to/file.xls")
# {"Sheet_Name": DataFrame}
# reading multiple files is possible as well
data = datesy.xls_file.load_these_files(file_name_list=["path/to/file1.xls", "path/to/file2.xls"])
# {file_name: {sheet_name: DataFrame}}
Converting Data¶
Check here all the examples for converting data easily with datesy
datesy helps you to easily convert certain types of data. Typical data formats are row-based or in form of a dictionary.
Rows to dictionary¶
When e.g. reading a csv_file as stated above, a row-based data structure is returned. If for further processing the rows shall be dictionized, it’s as simple as this:
example_rows = [
["Header1", "Header2", "Header3"],
["Value11", "Value12", "Value13"],
["Value21", "Value22", "Value23"]
]
example_rows = datesy.rows_to_dict(rows=example_dict)
example_dict = {
"Header1": {
"Value11": {
"Header2": "Value12",
"Header3": "Value13",
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
}
Relevant ID position / main key position:¶
It might occur, your most relevant key is not on the first position:
example_rows = [
["Header1", "Header2", "Header3"],
["Value11", "Value12", "Value13"],
["Value21", "Value22", "Value23"]
]
example_dict = datesy.rows_to_dict(rows=example_rows, main_key_position=2)
example_dict = {
"Header3": {
"Value13": {
"Header1": "Value11",
"Header2": "Value12"
},
"Value23": {
"Header1": "Value21",
"Header2": "Value22"
}
}
}
As you can see, the third entry (int=2) is used as the main_key.
Missing values¶
Of course, data may be missing a value:
example_rows = [
["Header1", "Header2", "Header3"],
["Value11",, "Value13"],
["Value21", "Value22", "Value23"]
]
example_dict = datesy.rows_to_dict(rows=example_rows, null_value="delete")
example_dict = {
"Header1": {
"Value11": {
"Header3": "Value13"
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
}
As you can see, the emtpy value in the rows is not represented in the dictionary.
Instead of missing the header_key you can also put any other value than delete
to this parameter for putting this to the exact spot:
example_rows = [
["Header1", "Header2", "Header3"],
["Value11",, "Value13"],
["Value21", "Value22", "Value23"]
]
example_dict = datesy.rows_to_dict(rows=example_rows, null_value=None)
example_dict = {
"Header1": {
"Value11": {
"Header2": None,
"Header3": "Value13"
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
}
Open ends / missing last row entries¶
If the rows do not contain emtpy values at the end of the row:
Normally, a check prevents handling this data as row-based data should always have the same length. Yet, if emtpy values at the end of the row are not stored like this, you can disable this check and still convert data:
example_rows = [
["Header1", "Header2", "Header3"],
["Value11", "Value12"],
["Value21", "Value22", "Value23"]
]
example_dict = datesy.rows_to_dict(rows=example_rows, contains_open_ends=True)
example_dict = {
"Header1": {
"Value11": {
"Header2": "Value12"
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
}
Selecting the header_line¶
For irrelevant data at the top of the row-based data, you can set the header_line to the desired position:
example_rows = [
["Undesired1", "Undesired2", "Undesired3"],
["Header1", "Header2", "Header3"],
["Value11", "Value12", "Value13"],
["Value21", "Value22", "Value23"]
]
example_dict = datesy.rows_to_dict(rows=example_rows, header_line=1)
example_dict = {
"Header1": {
"Value11": {
"Header2": "Value12",
"Header3": "Value13"
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
}
Dictionary to rows¶
Just as simple is the converting vice_versa from dictionary to rows:
example_dict = {
"Header1": {
"Value11": {
"Header2": "Value12",
"Header3": "Value13",
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
}
example_rows = datesy.dict_to_rows(data=example_dict)
example_rows = [
["Header1", "Header2", "Header3"],
["Value11", "Value12", "Value13"],
["Value21", "Value22", "Value23"]
]
Missing keys / not set data¶
When having data where certain keys are not set:
example_dict = {
"Header1": {
"Value11": {
"Header2": "Value12"
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
}
example_rows = datesy.dict_to_rows(data=example_dict)
example_rows = [
["Header1", "Header2", "Header3"],
["Value11", "Value12", ],
["Value21", "Value22", "Value23"]
]
Specify emtpy values:¶
Of course you can specify values to be set if a key is not set/emtpy:
example_dict = {
"Header1": {
"Value11": {
"Header2": "Value12"
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
}
example_rows = datesy.dict_to_rows(data=example_dict, if_emtpy_value=False)
example_rows = [
["Header1", "Header2", "Header3"],
["Value11", "Value12", False],
["Value21", "Value22", "Value23"]
]
Ordering the header¶
Just like picking the most relevant key in rows_to_dict, you can specify a certain order for the row-based data:
example_dict = {
"Header1": {
"Value11": {
"Header2": "Value12",
"Header3": "Value13"
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
}
example_rows = datesy.dict_to_rows(data=example_dict, order=["Header2", "Header3", "Header1"])
example_rows = [
["Header2", "Header3", "Header1"],
["Value12", "Value13", "Value11"],
["Value22", "Value23", "Value21"]
]
Data without main_key¶
What happens if you have data without a main_key like Header1 specified? Simply tell datesy about it:
example_dict = {
"Value11": {
"Header2": "Value12",
"Header3": "Value13",
},
"Value21": {
"Header2": "Value22",
"Header3": "Value23"
}
}
example_rows = datesy.dict_to_rows(data=example_dict, main_key_name="Header1")
example_rows = [
["Header1", "Header2", "Header3"],
["Value11", "Value12", "Value13"],
["Value21", "Value22", "Value23"]
]