Parsing JSON stream in Haskell

Photo by Shahadat Rahman / Unsplash

For one of my current pet projects I need to parse potentially huge amount of data in JSON. The obvious solution in such a case is to consume and parse the input data as a stream. This blog post describes the approach I solved the task with.

Since I had no experience with parsing JSON streams in Haskell, I started with googling. And I found nothing. No, there are couple of blog posts in the Internet about the topic, but part of them I can't use (because I'm not that smart, maybe) and the other part is totally outdated and/or doesn't actually work.

So, I decided to go with json-stream - an applicative incremental JSON parser for Haskell. And I also has taken some expiration from this post on Stackoverflow: Why doesn't print force entire lazy IO value?.

Below is a working example of parsing a JSON stream from Shopify API.

For this blog post, I took Shopify API as an example to play with. You can find more details on the API at their shopify dev section.

First, let's define a customer and customer address data objects:

module Models ( Customer(..), CustomerAddress(..)) where

import qualified Data.Text as T

data CustomerAddress =
  CustomerAddress
  { addressExtId :: !Int
  , customerExtId :: !Int
  , firstName :: !T.Text
  , lastName :: !T.Text
  , companyName :: !T.Text
  , address1 :: !T.Text
  , address2 :: !T.Text
  , city :: !T.Text
  , province :: !T.Text
  , provinceCode :: !T.Text
  , country :: !T.Text
  , countryCode :: !T.Text
  , countryName :: !T.Text
  , zipCode :: !T.Text
  , phone :: !T.Text
  , customerName :: !T.Text
  } deriving (Eq, Show)

data Customer =
  Customer
  { customerExtId :: !Int
  , email :: !(Maybe T.Text)
  , createdAt :: !T.Text
  , updatedAt :: !T.Text
  , firstName :: !T.Text
  , lastName :: !T.Text
  , customerState :: !T.Text
  , note :: !(Maybe T.Text)
  , addresses :: ![CustomerAddress]
  } deriving (Eq, Show)

Now, let's write a parser that can consume data and produces the customer address object:

import Control.Applicative (many)
import qualified Data.Text as T
import qualified Data.JsonStream.Parser as J
import Data.JsonStream.Parser ((.:), (.:?), (.|))
import qualified Models as M

addressParser :: J.Parser M.CustomerAddress
addressParser =
  M.CustomerAddress
    <$> "id" .: J.integer
    <*> "customer_id" .: J.integer
    <*> "first_name" .: J.string .| ""
    <*> "last_name" .: J.string .| ""
    <*> "company" .: J.string .| ""
    <*> "address1" .: J.string
    <*> "address2" .: J.string .| ""
    <*> "city" .: J.string
    <*> "province" .: J.string
    <*> "province_code" .: J.string
    <*> "country" .: J.string
    <*> "country_code" .: J.string
    <*> "country_name" .: J.string
    <*> "zip" .: J.string
    <*> "phone" .: J.string .| ""
    <*> "name" .: J.string

Next, we write a parser for customer object:

customerParser :: J.Parser M.Customer
customerParser =
  M.Customer
    <$> "id" .: J.integer
    <*> "email" .:? J.string
    <*> "created_at" .: J.string
    <*> "updated_at" .: J.string
    <*> "first_name" .: J.string .| ""
    <*> "last_name" .: J.string .| ""
    <*> "state" .: J.string
    <*> "note" .:? J.string
    <*> many ("addresses" .: J.arrayOf addressParser)

And now let's write a function that consumes input data and uses the both parsers for producing the consumer object:

customersParser :: J.Parser M.Customer
customersParser =
  J.objectWithKey "customers" $ J.arrayOf customerParser

So, we have written the parser and now it's time to pull some data from the Internet and feed it to the parser. For simplicity, I use here a fake API URL and omit authorization - please, refer to the API documentation for relevant details. This function opens a new connection to the API endpoint and feeds the content to `parseWith` function that parses the input stream and counts number of the parsed objects.

import qualified Network.HTTP.Client as H2
import qualified Network.HTTP.Client.TLS as H2
import qualified Data.ByteString as B

someFunc :: IO ()
someFunc = do
    manager <- H2.newManager H2.tlsManagerSettings
    request <- H2.parseRequest "https://my-haskell-shop.myshopify.com/admin/api/2021-01/customers.json?limit=50"
    H2.withResponse request manager $ \response -> do
      putStrLn "The status code was: "
      print (H2.responseStatus response)
      chunk <- H2.responseBody response
      cnt <- parseWith (H2.responseBody response) customersParser chunk
      putStrLn ("parsed " ++ show cnt ++ " orders")

parseWith :: IO B.ByteString -> J.Parser Order -> B.ByteString -> IO Int
parseWith refill scheme inp = do
    let pout = J.runParser' scheme inp
    doparse pout 0
    where
      doparse (J.ParseDone _) cnt = return cnt
      doparse (J.ParseFailed err) cnt = return cnt
      doparse (J.ParseYield v next) cnt = p v >> doparse next (cnt + 1)
      doparse (J.ParseNeedData cont) cnt = do
          dta <- refill
          doparse (cont dta) cnt

For this example I used stack and LTS 17.7. Relevant dependencies are:

  aeson >=1.5.6.0
  , base >=4.7 && <5
  , bytestring >=0.10.12.0
  , http-client >=0.6.4.1
  , http-client-tls >=0.3.5.3
  , json-stream >= 0.4.2.4
  , text >=1.2.4.1

This approach worked pretty well for me.

Andrii Serhiienko

Andrii Serhiienko

Sweden, Stockholm