The datesy package

The datesy package is divided in 5 main components:

  1. file I/O (subpackage)
  2. dabase I/O (subpackage)
  3. data converting
  4. data inspecting
  5. matching data sets

Subpackages

All actions of interacting with files are to be found here:

All actions of interacting with databases are to be found here:

Submodules

datesy.convert module

All actions of transforming data from different file formats are to be found here

datesy.convert.rows_to_dict(rows, main_key_position=0, null_value='delete', header_line=0, contains_open_ends=False)

Convert a row of rows (e.g. csv) to dictionary

Parameters:
  • rows (list) – the row based data to convert to dict
  • main_key_position (int, optional) – if the main_key is not on the top left, its position can be specified
  • null_value (any, optional) – if an emtpy field in the lists shall be represented somehow in the dictionary
  • header_line (int, optional) – if the header_line is not the first one, its position can be specified
  • contains_open_ends (bool, optional) – if each row is not in the same length (due to last set entry as last element in row), a length check for corrupted data can be ignored
Returns:

dictionary containing the information from row-based data

Return type:

dict

datesy.convert.dict_to_rows(data, main_key_name=None, main_key_position=None, if_empty_value=None, order=None)

Convert a dictionary to rows (list(lists))

Parameters:
  • data (dict) – the data to convert in form of a dictionary
  • main_key_name (str, optional) – if the data isn’t provided as {main_key: data} the key needs to be specified
  • main_key_position (int, optional) – if the main_key shall not be on top left of the data the position can be specified
  • if_empty_value (any, optional) – if a main_key’s sub_key is not set something different than blank can be defined
  • order (dict, list, None, optional) – if a special order for the keys is required
Returns:

list of rows representing the csv based on the main_element_position

Return type:

list(lists)

datesy.convert.pandas_data_frame_to_dict(data_frame, main_key_position=0, null_value='delete', header_line=0)

Converts a single file_name from xlsx to json

Parameters:
  • data_frame (pandas.core.frame.DataFrame) –
  • main_key_position (int, optional) –
  • null_value (any, optional) –
  • header_line (int, optional) –
Returns:

the dictionary representing the xlsx based on main_key_position

Return type:

dict

datesy.convert.dict_to_pandas_data_frame(data, main_key_name=None, order=None, inverse=False)

Convert a dictionary to pandas.DataFrame

Parameters:
  • data (dict) – dictionary of handling
  • main_key_name (str, optional) – if the json or dict does not have the main key as a single {main_element : dict} present, it needs to be specified
  • order (dict, list, optional) – list with the column names in order or dict with specified key positions
  • inverse (bool, optional) – if columns and rows shall be switched
Returns:

DataFrame representing the dictionary

Return type:

pandas.DataFrame

datesy.convert.xml_to_standard_dict(ordered_data, reduce_orderedDicts=False, reduce_lists=False, manual_selection_for_list_reduction=False)

Convert a xml/orderedDict to normal dictionary

Parameters:
  • ordered_data (orderedDict) – input xml data to convert to standard dict
  • reduce_orderedDicts (bool, optional) – if collections.orderedDicts shall be converted to normal dicts
  • reduce_lists (bool, list, set, optional) – if lists in the dictionary shall be converted to dictionaries with transformed keys (list_key + unique key from dictionary from list_element) if list or set is provided, only these values will be reduced
  • manual_selection_for_list_reduction (bool, optional) – if manually decision on list reduction shall be used all keys in reduce_lists will be automatically reduced
Returns:

the normalized dictionary

Return type:

dict

datesy.inspect module

All actions of inspecting data are to be found here

datesy.inspect.find_header_line(data, header_keys)

Find the header line in row_based data_structure NOT IMPLEMENTED YET: Version 0.9 feature

Parameters:
  • data (list, pandas.DataFrame) –
  • header_keys (str, list, set) – some key(s) to find in a row
Returns:

the header_line

Return type:

int

datesy.inspect.find_key(data, key=None, regex_pattern=None)

Find a key in a complex dictionary

Parameters:
  • data (dict) – the data structure to find the key
  • key (str, optional) – a string to be found
  • regex_pattern (str, optional) – a regex match to be found
Returns:

all matches and their path in the structure {found_key: path_to_key}

Return type:

dict

datesy.matching module

All actions of mapping data to other data as well as the functions helpful for that are to be found here

datesy.matching.simplify_strings(to_simplify, lower_case=True, simplifier=True)

Simplify a string, set(strings), list(strings), keys in dict Options for simplifying include: lower capitals, separators, both (standard), own set of simplifier

Parameters:
  • to_simplify (list, set, string) – the string(s) to simplify presented by itself or as part of another data format
  • lower_case (bool, optional) – if the input shall be converted to only lower_case (standard: True)
  • simplifier (str, optional) – the chars to be removed from the string. if type bool and True, standard chars _ , | \n ' & " % * - \ used
Returns:

simplified values {simplified_value: input_value}

Return type:

dict

datesy.matching.ease_match_similar(list_for_matching, list_to_be_matched_to, simplified=False, similarity_limit_for_matching=0.6, print_auto_matched=False)

Return a dictionary with list_for_matching as keys and list_to_be_matched_to as values based on most similarity. Matching twice to the same value is possible! Similarity distance for stopping the matching is set by distance_for_automatic_vs_manual_matching. Faster than datesy.matching.match_comprehensive but when having very similar strings more likely to contain errors.

Parameters:
  • list_for_matching (list, set) – Iterable of strings which shall be matched
  • list_to_be_matched_to (list, set) – Iterable of stings which shall be matched to
  • simplified (False, "capital", "separators", "all", list, str, optional) – For reducing the values by all small letters or unifying & deleting separators separators or any other list of strings provided
  • print_auto_matched (bool, optional) – Printing the matched entries during process (most likely for debugging)
  • similarity_limit_for_matching (float, optional) – For not matching the most irrelevant match which could exist
Returns:

  • match (dict) – {value_for_matching: value_to_be_mapped_to}
  • no_match (set) – A set of all values from list_for_matching that could not be matched

datesy.matching.match_comprehensive(list_for_matching, list_to_be_matched_to, simplified=False)

Return a dictionary with list_for_matching as keys and list_to_be_matched_to as values based on most similarity. All values of both iterables get compared to each other and highest similarities are picked. Slower than datesy.matching.ease_match_similar but more precise.

Parameters:
  • list_for_matching (list, set) – Iterable of strings which shall be matched
  • list_to_be_matched_to (list, set) – Iterable of stings which shall be matched to
  • simplified (False, "capital", "separators", "all", list, str, optional) – For reducing the values by all small letters or unifying & deleting separators separators or any other list of strings provided
Returns:

  • match (dict) – {value_for_matching: value_to_be_mapped_to}
  • no_match (set) – A set of all values from list_for_matching that could not be matched

datesy.matching.match_similar_with_manual_selection(list_for_matching, list_to_be_matched_to, simplified=False, minimal_distance_for_automatic_matching=0.1, print_auto_matched=False, similarity_limit_for_manual_checking=0.6)

Return a dictionary with list_for_matching as keys and list_to_be_matched_to as values based on most similarity. All possible matches not matched automatically (set limit with minimal_distance_for_automatic_matching) can be handled interactively. Similarity distance for stopping the matching is set by distance_for_automatic_vs_manual_matching.

Parameters:
  • list_for_matching (list, set) – Iterable of strings which shall be matched
  • list_to_be_matched_to (list, set) – Iterable of stings which shall be matched to
  • simplified (False, "capital", "separators", "all", list, str, optional) – For reducing the values by all small letters or unifying & deleting separators separators or any other list of strings provided
  • print_auto_matched (bool, optional) – Printing the matched entries during process (most likely for debugging)
  • minimal_distance_for_automatic_matching (float, optional) – If there is a vast difference between the most and second most matching value, automatically matching is provided This parameter provides the similarity distance to be reached for automatically matching
  • similarity_limit_for_manual_checking (float, optional) – For not showing/matching the most irrelevant match which could exist
Returns:

  • match (dict) – {value_for_matching: value_to_be_mapped_to}
  • no_match (set) – A set of all values from list_for_matching that could not be matched