drop_duplicate_rows

cocohelper.utils.dataframe.drop_duplicate_rows(df, ignore_columns=None)[source]

Drop duplicates rows of a DataFrame and return a map of merged elements.

Duplicate are defined as rows with the same values except the index. Some columns can be ignored at the end of identifying duplicates.

Parameters:
  • df (DataFrame) – input DataFrame.

  • ignore_columns (Optional[List[str]]) – the columns to ignore for duplicates identification.

Returns:

  • The DataFrame without duplicates.

  • A dict that maps indices of the dropped (merged) elements to the indices of the corresponding kept elements.

Return type:

Tuple[DataFrame, dict]