Automating Enrichment Jobs

Install

First you need to make sure you have the Analytics package installed. If you aren't sure try running this:



:::bash
    pip install demyst-analytics



Test Data


First, let's create some test data to use in this example. In an IPython environment or in a Python script, execute this code:



:::python
    import pandas as pd
    test_df = pd.DataFrame({'email_address': ['test@test.com', 'test2@test.com']})
    test_df.to_dense().to_csv("inputs.csv", index = False, sep=',', encoding='utf-8')

You should end up with a file called inputs.csv that looks like this:



email_address
    test@test.com
    test2@test.com



Automation

Now that we have some test data, let's build a script to enrich our input file using the Demyst platform. For purposes of this test we are going to be using the domain_from_email data product, which is a test product Demyst offers that simply splits up email_address columns sent to it.

Let's start by importing the necessary packages.



:::python
    import pandas as pd
    from demyst.analytics import Analytics

You will need a production API Key from the Demyst Console.



analytics = Analytics(key='XXXXXX')


If you don't have an API Key yet, you can test using your Username and Password by leaving out the key parameter.



analytics = Analytics()


Now let's read in our inputs file. Because our CSV file has a header that is understood by the Demyst platform email_address, the file can be used as a dataframe without modification.



inputs = pd.read_csv('inputs.csv')


To enrich the file, we pass the list of providers along with the input dataframe to the enrich function.



job_id = analytics.enrich(['domain_from_email'], inputs, validate=False)


The enrich_download function will block until the job is complete and return a dataframe:



outputs = analytics.enrich_download(job_id)


Lastly, we can take the resulting ouput dataframe, and write it to a file.



outputs.to_dense().to_csv('outputs.csv', index = False, sep=',', encoding='utf-8')


The output of this script will be a file called outputs.csv which should look like this:



inputs.email_address,domain_from_email.row_id,domain_from_email.client_id,domain_from_email.host,domain_from_email.user,domain_from_email.error
    test@test.com,0,,test.com,test,
    test2@test.com,1,,test.com,test2,


This output could be for the next stage of ETL pipeline or it could be imported into a modeling tool.

The full solution is provided below. If you need help automating a production job, don't hesitate to reach out to support@demystdata.com.



:::python
    import pandas as pd
    from demyst.analytics import Analytics

    analytics = Analytics()

    inputs = pd.read_csv('inputs.csv')

    job_id = analytics.enrich(['domain_from_email'], 
                            inputs, 
                            validate=False)

    outputs = analytics.enrich_download(job_id)

    outputs.to_dense().to_csv('outputs.csv', 
                            index = False, 
                            sep=',', 
                            encoding='utf-8')