An interesting thing happened while pasting shorter poems into various AI Art generators (specifically Midjourney, Dall-E, and Stable Diffusion). I don’t quite know what to make of recent results, as I only superficially understand how such algorithms work. I decided to use Langston Hughes’s poem “Harlem” because I wanted to see if the word “Harlem” returned specific architectural characteristics. My hunch was that these algorithms would “scrape” various photos and maps of Harlem and morph and render them. I also wanted to see how a specific geographical and cultural reference point might mesh with concepts like “Maybe it just sags / like a heavy load.”
The renderings provided by the various engines were remarkably different. As a warning, I think you will recognize one set of images as being drastically different from the others. For the sake of experimentation, I ran one set which included the title/word “Harlem” in the initiating prompt, and a second where I subtracted the title. I believe what’s important here is that the second renderings, without the word “Harlem,” remove any reference to a specific place or even specific population that might live there. Minus the title, there is nothing to suggest a human presence beyond the word “dream.” Even if humans are dreamers, that doesn’t mean we always have people in our dreams.
First, we have Midjourney, which continues to return (to my eye at least) the most aesthetically pleasing images. If I had to guess, I’d say “Harlem” was the trigger for the urban rendering, while its absence resulted in more influence for phrases/words like “in the sun” and “syrupy,” hence the rural scene.


Dall-E is fun in that its renderings are more literal, clear, and photograpic on a first pass (I could prompt for a certain style, say “Impressionist”; I chose not to). Not surprisingly, we get literal raisins! (The first set below included the word “Harlem.”)



Dall-E, minus the word “Harlem” in the prompt, initially returned the four images below…

And then we get to Stable Diffusion. It is hard for me to comment on what I see, so below are the images that were rendered both with and without “Harlem” in the prompt. All returns included human figures.


When subtracting the word “Harlem,” the earlier examples from Midjourney and Dall-E resulted in a shift to rural images or still-life renditions of objects. The Stable Diffusion engine responded differently.

Overall, when entering the Hughes poem as a prompt into Stable Diffusion, both the presence and absence of the word “Harlem” almost always resulted in the presence of human-like figures in the rendered image. That being said, Stable Diffusion did return several images that were largely text based with nonsense language, like the one below.
I am not sure if I am correct about the following, but I wonder if Stable Diffusion recognized the prompt as an existing text or even the specific poem by Langston Hughes. It was the only AI that returned text-heavy images, and this could be the result of my simply not following the conventional prompt format (I am certainly not using the language for prompts “properly”). What is clear is that whether the word “Harlem” was included or not, non-white human figures were rendered in the returned images, most often in a “deformed” fashion. Historically, we are accustomed to deformed human representation as a vehicle for racism (take a quick peek at the Jim Crow Museum of Racist Memorabilia). However, almost all of the most “realistic” human forms rendered by these three platforms have been deformed in some way (maybe to deliberately prevent recognition?) Here is a rendering I posted earlier, of Ezra Pound’s “In a Station of the Metro.” Take a look at the nightmare that is these faces in the crowd:
I don’t have extensive enough knowledge to discuss “intent” or motivation. However, having done some reading on algorithms and how we should expect them to be encoded with the biases of their human creators, the results here make me want to pursue this further. I just don’t know how yet. Maybe what is at issue are my own biases. I look at the deformed white faces in the Pound rendering and think, “haunting.” I look at the renderings of the non-white humans and immediately flinch and think of racist imagery. In fact, I didn’t want to post any of the images here–in this online space–for fear of how they would be received. Are these images any more ridiculous than the “self-portraits” described in an earlier post?
It’s entirely possible that there are obvious explanations that a deeper understanding of algorithmic programming might unlock. Maybe what I’m struck by the most here is that I can easily recognize racism in human structures, systems, and individuals. With the algorithms, I am confused by the human distance from intention and results. But maybe this is my own failing, as the algorithms themselves are human-built structures. It’s their “newness” and assumed complexity that clouds my ability to interpret them.