Understanding the Airbyte CDK #3: read
command
#33814
marcosmarxm
started this conversation in
Guides & Tutorials
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
How the Airbyte CDK read command works
launch()
function which receives the source object referenceread_catalog()
will read the injected file in/tmp/workspace/<JOB_ID>/<ATTEMPT>
and map to AirbyteCatalog objectConfiguredAirbyteCatalog
objectread_state()
does the same but for the state fileAirbyteStateMessage
objectread
first validate if the config is valid and after calls theread
function from the Source classread_records
from theHttpStream
class. (Some special connectors overwrite this function but mostly use from the base class).request_param
andnext_page_token
function outputparse_response
reads the json object returned from the API and output individual records which will be broadcast to the STDIN and be read by the Airbyte Worker and sent to the destinationFrom the
[Entrypoint.py](http://Entrypoint.py)
file in Airbyte CDKcheck
function first the read will mask any secretsconfig
is compatible with thespec
version of the connectorread
function from theAbstractSource
classThe
read
function in theAbstractSource
is quite big one.discover
function it will call thestreams
function to retrieve all streams but now created a dictionary to easily map the stream class from their nameConnectorStateManager
which will handle all state messagesConfiguredStream
(the selected stream by the user in the UI during the connection creation)HttpAvailabilityStrategy
class which checks the first recordSTARTED
_read_stream
incremental
orfull_refresh
for each case will call a different function to read the records.Full Refresh
Breaking down the
full_refresh
methodThe first step is to call the
stream_slices
. This concept is somewhat complex and can mislead users. Let's use one example to make it easier:In our example our service has two endpoints
Orders
andOrderDetails
The
Orders
return only the list of orders without any detailsAnd whatever we access the
OrderDetails
with theorder_details/2
This can be translated as
Another situations where you can use
stream_slices
are:canceled
completed
pending
so you can create a parameter where the user has the option to retrieve records from each of them.Well, returning to the full refresh reading function…
It will iterate over all stream slices and call for each one of them the
read_records
function.This function is implemented in the
HttpStream
class (for most cases)Stay with me! I know this one is not the most basic function 😟
Trying to translate it's generating an anonymous function which calls the
parse_response
The
parse_response
is one of the function you must implement during the connector creation.And calls the
_read_pages
function which iterates over all pages from our API_fetch_next_page
creates and sends the actual request to the APIbody_json
body_data
headers
and therequest_parameter
itself.read_pages
receives the response and pass to theparse_response
function to yield the records.next_page_token
dict it will send the data to next request and keep thepagination_complete = False
Incremental Reading
The incremental reading is quite similar to the full refresh. The difference is when it starts will retrieve the previous
stream_state
and build the first HTTP request using this data.And for each record read it will try to update the state which has a specific logic to get updated.
We can check here we update the counter and try to checkpoint the state
What customization you can add to your connector?
Beta Was this translation helpful? Give feedback.
All reactions