Paribus FAQs
What do I need to run Paribus?
You need data! See the recommended minimum system requirements for Paribus but, really, Paribus simply needs to be on a PC (or server with remote access) so you can then point it at te database(s) you want to deduplicate.
What data can Paribus match?
Paribus performs data matching tasks to find duplicates in any database you can connect to with an ODBC connection. You can use Paribus with any SQL, Oracle or DB2 database.
How does Paribus connect to a database?
When you run Paribus you need to establish a connection with your database. You do this by setting the data link properties when you set up a data provider. Paribus uses the Microsoft OLE DB provider for whatever type of database it is you're connecting to (SQL Server, Oracle for example).
What is a data provider?
To enable data matching with Paribus, and find the potential duplicates in a database, you must define that source of data. This is achieved by defining a Paribus data provider. The data provider enables the configuration of a link to the data, as seen in the previous question, and the identification of the database itself. A data provider is used by a data set to access the underlying data. A given data provider can be used for any number of data sets.
What do I see when I connect to a database using Paribus?
When you have successfully connected to a database, you will see all the tables it contains. These are the tables on which you can run data matching sessions with Paribus and from which you can find duplicate data. A table comprises fields into which you will have entered data. In the account table, for example, you will find fields called "Company name", "Address" and "Post code". A good idea for making your data matching more accurate is to find the unique data in the database (the primary key) and run your data matching against this.
Can I only deduplicate uncustomised databases?
No. Paribus will display any and all of the tables of any database(s) it is connected to.
What is ODBC?
ODBC stands for Open Database Connectivity. It is a standard database access method that enables access to any data from any application, regardless of which database management system (DBMS) is handling the data. ODBC does this by inserting a middle layer, called a database driver, between the application and the DBMS. ODBC translates Paribus's data queries into commands that the DBMS understands through this layer.
What is a data set?
A data set defines how the data and supporting information you want to deduplicate will be retrieved from your database as you deduplicate it. A data set also enables the creation of filters that you can apply when you're deduplicating data so you can limit the extent of the data you retrieve. Data sets are referenced by match sets and match conditions and a given data set can be used by any number of match sets and match conditions.
What is a match set?
A match set is a template that defines the initial two items of data (data sets) you will compare in Paribus. A match set also defines the related match conditions you can apply to the match set when you use it in Paribus. When a match set is used within a match session, the definitions of the match set are derived from the default settings but these can be overridden within the configuration of the match session.
What is a match condition?
A match condition is a template that defines additional match criteria that you can apply to the matches established from a match set. Once a match set has established a collection of initial matches, you can apply a match condition to those matches. You can apply one or more match conditions during the deduplication process and only those results the meet the criteria of all or any one of the conditions will remain in the results of the match session. Match conditions are linked to match sets based upon their related data and are available for selection in the match process, via the match session.
What is a match session?
A match session is the highest-level object in Paribus and defines the rules that you will apply during the match process, when you're deduplicating data. It also contains the match results for that process and allows you to view, review and export or process the results using various tools and plug ins.
How does Paribus find duplicates?
When you have chosen and configured the match sets to run against your selected data sets, you are about ready to run a Paribus match session. This is the data matching process itself and will find duplicates in your target database(s) irrespective of word order, synonyms, phonetics, noise words and case or spelling variations.
How do I find more duplicates?
You can set the intensity of the data matching process so Paribus identifies data that is more exactly or more loosely matched. You can look for duplicates of 60% to 100% similarity. There is no rule that specifies the characteristics of data that is 72% or 87% matched because the percentage values depend on the initial quality of the data you're retrieving. You can also run extended data matching to return even more potential duplicates but keep in mind that you want to be reviewing meaningful results and that loosely calibrated, extended data matching may return a greater percentage of records that are similar but not duplicates.
Depending on the structure of the data you're working on, you may wish to run a succession of differently configured match sessions to generate the most effective results. You should also remember that match conditions will enable the qualification of a particular match process so you can be more specific in what data you match. Extended Matching will enable the application of more rigorous matching rules and might turn up results you weren't expecting.
QGate can advise on a number of other methods and processes you can use to get the best possible results.
